from:"Alex Lyakas"

Btrfs: cut down on loops through the allocator - a5e681d9bd641c4f0677e87d3a0c92a8f4f16293

2019-03-07 Thread Alex Lyakas

Hi Josef,

This commit added the following code in find_free_extent:
ret = do_chunk_alloc(trans, root, flags,
 CHUNK_ALLOC_FORCE);
+
+   /*
+* If we can't allocate a new chunk we've already looped
+* through at least once, move on to the NO_EMPTY_SIZE
+* case.
+*/
+   if (ret == -ENOSPC)
+   loop = LOOP_NO_EMPTY_SIZE;
+

With this, I am hitting an early enospc, with the following scenario:
- assume a file system is almost full, and has 5GB free space on the device left
- there are multiple threads (say 6) calling find_free_extent() in
parallel with empty_size=0
- they all don't find a block group to allocate from, so they call
do_chunk_alloc()
- 5x1GB chunks are allocated, but additional do_chunk_alloc call()
returns -ENOSPC
- As a result, this thread moves to LOOP_NO_EMPTY_SIZE, but since
empty_size is already zero, it returns -ENOSPC to the caller. But in
fact we have 5GB free space to allocate from. We just need to to an
extra loop.

Basically, do_chunk_alloc() returning -ENOSPC does not mean that there
is no space. It can happen that parallel chunk allocations exhausted
the device, but we have plenty of space.

Furthermore, this thread now sets space_info->max_extent_size. And
from now on, any allocation that needs more than max_extent_size will
immediately fail. But if we unmount and mount again, we will have
plenty of space.

This happens to me on 4.14, but I think the mainline still has the same logic.

Thanks,
Alex.

Re: [GIT PULL] Btrfs fixes for 4.15-rc2

2019-02-18 Thread Alex Lyakas

Hi David,

>   Btrfs: incremental send, fix wrong unlink path after renaming file 
> (2017-11-28 17:15:30 +0100)
>
> 
> David Sterba (2):
>   btrfs: add missing device::flush_bio puts
Is there a reason that this one should not be tagged as "stable"? At
least the missing bio_put in  free_fs_devices is not an error case,
and would happen every time.

Thanks,
Alex.

Re: [PATCH 0/7] eb reference count cleanups

2019-02-06 Thread Alex Lyakas

Hi Nikolay,

In my kernel (4.14.x) the flag is called EXTENT_BUFFER_DUMMY, and
indeed I see that there is an extra dec-ref for them in
free_extent_buffer().

Thanks for clearing that up,
Alex.



On Wed, Feb 6, 2019 at 4:36 PM Nikolay Borisov  wrote:
>
>
>
> On 6.02.19 г. 16:26 ч., Alex Lyakas wrote:
> > Hi Nikolay, David,
> >
> > Isn't patch 5 (btrfs: Remove extra reference count bumps in
> > btrfs_compare_trees) fixing a memory leak, and hence should be tagged
> > as "stable"? I am specifically interested in 4.14.x.
> >
> > The comment says "remove redundant calls to extent_buffer_get since
> > they don't really add any value". But with the extra ref-count, the
> > extent buffer will not be properly freed and will cause a memory leak,
> > won't it?
>
> No, take a look into the logic of free_extent_buffer and see there is
> special handling for cloned buffers. And in fact, the series this patch
> was part of exactly removed this special handling since it's rather
> non-intuitive, your email being case in point.
>
> >
> > Thanks,
> > Alex.
> >
> >
> >
> > On Tue, Nov 6, 2018 at 4:30 PM David Sterba  wrote:
> >>
> >> On Wed, Aug 15, 2018 at 06:26:50PM +0300, Nikolay Borisov wrote:
> >>> Here is a series which simplifies the way eb are used in 
> >>> EXTENT_BUFFER_UNMAPPED
> >>> context. The end goal was to remove the special "if we have ref count of 
> >>> 2 and
> >>> EXTENT_BUFFER_UNMAPPED flag then act as if this is the last ref and free 
> >>> the
> >>> buffer" case. To enable this the first 6 patches modify call sites which
> >>> needlessly bump the reference count.
> >>>
> >>> Patch 1 & 2 remove some btree locking when we are operating on unmapped 
> >>> extent
> >>> buffers. Each patch's changelog explains why this is safe to do .
> >>>
> >>> Patch 3,4,5 and 6 remove redundant calls to extent_buffer_get since they 
> >>> don't
> >>> really add any value. In all 3 cases having a reference count of 1 is 
> >>> sufficient
> >>> for the eb to be freed via btrfs_release_path.
> >>>
> >>> Patch 7 removes the special handling of EXTENT_BUFFER_UNMAPPED flag in
> >>> free_extent_buffer. Also adjust the selftest code to account for this 
> >>> change
> >>> by calling one extra time free_extent_buffer. Also document which 
> >>> references
> >>> are being dropped. All in all this shouldn't have any functional bearing.
> >>>
> >>> This was tested with multiple full xfstest runs as well as unloading the 
> >>> btrfs
> >>> module after each one to trigger the leak check and ensure no eb's are 
> >>> leaked.
> >>> I've also run it through btrfs' selftests multiple times with no problems.
> >>>
> >>> With this set applied EXTENT_BUFFER_UNMAPPED seems to be relevant only for
> >>> selftest which leads me to believe it can be removed altogether. I will
> >>> investigate this next but in the meantime this series should be good to 
> >>> go.
> >>
> >> Besides the 8/7 patch, the rest was in for-next for a long time so I'm
> >> merging that to misc-next, targeting 4.21. I'll do one last pass
> >> thhrough fstests with the full set and then upddate and push the branch
> >> so there might be some delay before it appears in the public repo.
> >> Thanks for the cleanup.
> >

Re: [PATCH 0/7] eb reference count cleanups

2019-02-06 Thread Alex Lyakas

Hi Nikolay, David,

Isn't patch 5 (btrfs: Remove extra reference count bumps in
btrfs_compare_trees) fixing a memory leak, and hence should be tagged
as "stable"? I am specifically interested in 4.14.x.

The comment says "remove redundant calls to extent_buffer_get since
they don't really add any value". But with the extra ref-count, the
extent buffer will not be properly freed and will cause a memory leak,
won't it?

Thanks,
Alex.



On Tue, Nov 6, 2018 at 4:30 PM David Sterba  wrote:
>
> On Wed, Aug 15, 2018 at 06:26:50PM +0300, Nikolay Borisov wrote:
> > Here is a series which simplifies the way eb are used in 
> > EXTENT_BUFFER_UNMAPPED
> > context. The end goal was to remove the special "if we have ref count of 2 
> > and
> > EXTENT_BUFFER_UNMAPPED flag then act as if this is the last ref and free the
> > buffer" case. To enable this the first 6 patches modify call sites which
> > needlessly bump the reference count.
> >
> > Patch 1 & 2 remove some btree locking when we are operating on unmapped 
> > extent
> > buffers. Each patch's changelog explains why this is safe to do .
> >
> > Patch 3,4,5 and 6 remove redundant calls to extent_buffer_get since they 
> > don't
> > really add any value. In all 3 cases having a reference count of 1 is 
> > sufficient
> > for the eb to be freed via btrfs_release_path.
> >
> > Patch 7 removes the special handling of EXTENT_BUFFER_UNMAPPED flag in
> > free_extent_buffer. Also adjust the selftest code to account for this change
> > by calling one extra time free_extent_buffer. Also document which references
> > are being dropped. All in all this shouldn't have any functional bearing.
> >
> > This was tested with multiple full xfstest runs as well as unloading the 
> > btrfs
> > module after each one to trigger the leak check and ensure no eb's are 
> > leaked.
> > I've also run it through btrfs' selftests multiple times with no problems.
> >
> > With this set applied EXTENT_BUFFER_UNMAPPED seems to be relevant only for
> > selftest which leads me to believe it can be removed altogether. I will
> > investigate this next but in the meantime this series should be good to go.
>
> Besides the 8/7 patch, the rest was in for-next for a long time so I'm
> merging that to misc-next, targeting 4.21. I'll do one last pass
> thhrough fstests with the full set and then upddate and push the branch
> so there might be some delay before it appears in the public repo.
> Thanks for the cleanup.

Re: [PATCH 2/2] Btrfs: fix unprotected deletion from pending_chunks list

2019-01-22 Thread Alex Lyakas

Hi Filipe,

Thank you for your response. I realize it was a long time, ago, but we
are just now in the process of moving to stable kernel 4.14.x.

Regarding the fix, I see now the relevant code in "btrfs_remove_block_group":
mutex_lock(&fs_info->chunk_mutex);
if (!list_empty(&em->list)) {
/* We're in the transaction->pending_chunks list. */
free_extent_map(em);
}
...
However, this brings another doubt. Let's say we indeed performed
free_extent_map in the above code. But later we may do:
/*
 * Our em might be in trans->transaction->pending_chunks which
 * is protected by fs_info->chunk_mutex ([lock|unlock]_chunks),
 * and so is the fs_info->pinned_chunks list.
 *
 * So at this point we must be holding the chunk_mutex to avoid
 * any races with chunk allocation (more specifically at
 * volumes.c:contains_pending_extent()), to ensure it always
 * sees the em, either in the pending_chunks list or in the
 * pinned_chunks list.
 */
list_move_tail(&em->list, &fs_info->pinned_chunks);

So we have dropped the ref that was held by
"transaction->pending_chunks" list, and now we moved the "em" to the
pinned_chunks without a ref. But the code assumes that "pinned_chunks"
also has a ref on the "em". For example in close_ctree, we do:
while (!list_empty(&fs_info->pinned_chunks)) {
struct extent_map *em;

em = list_first_entry(&fs_info->pinned_chunks,
  struct extent_map, list);
list_del_init(&em->list);
free_extent_map(em);
}

Can you please comment on that?

Thanks,
Alex.

On Mon, Jan 21, 2019 at 10:06 PM Filipe Manana  wrote:
>
> On Mon, Jan 21, 2019 at 7:07 PM Alex Lyakas  wrote:
> >
> > Hi Filipe,
> >
> > On Tue, Dec 2, 2014 at 8:08 PM Filipe Manana  wrote:
> > >
> > > On block group remove if the corresponding extent map was on the
> > > transaction->pending_chunks list, we were deleting the extent map
> > > from that list, through remove_extent_mapping(), without any
> > > synchronization with chunk allocation (which iterates that list
> > > and adds new elements to it). Fix this by ensure that this is done
> > > while the chunk mutex is held, since that's the mutex that protects
> > > the list in the chunk allocation code path.
> > >
> > > This applies on top (depends on) of my previous patch titled:
> > > "Btrfs: fix race between fs trimming and block group remove/allocation"
> > >
> > > But the issue in fact was already present before that change, it only
> > > became easier to hit after Josef's 3.18 patch that added automatic
> > > removal of empty block groups.
> > >
> > > Signed-off-by: Filipe Manana 
> > > ---
> > >  fs/btrfs/extent-tree.c | 8 +++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > index 17d429d..a7b81b4 100644
> > > --- a/fs/btrfs/extent-tree.c
> > > +++ b/fs/btrfs/extent-tree.c
> > > @@ -9524,19 +9524,25 @@ int btrfs_remove_block_group(struct 
> > > btrfs_trans_handle *trans,
> > > list_move_tail(&em->list, &root->fs_info->pinned_chunks);
> > > }
> > > spin_unlock(&block_group->lock);
> > > -   unlock_chunks(root);
> > >
> > > if (remove_em) {
> > > struct extent_map_tree *em_tree;
> > >
> > > em_tree = &root->fs_info->mapping_tree.map_tree;
> > > write_lock(&em_tree->lock);
> > > +   /*
> > > +* The em might be in the pending_chunks list, so make 
> > > sure the
> > > +* chunk mutex is locked, since remove_extent_mapping() 
> > > will
> > > +* delete us from that list.
> > > +*/
> > > remove_extent_mapping(em_tree, em);
> > > write_unlock(&em_tree->lock);
> > If the "em" was in pending_chunks, it will be deleted from that list
> > by "remove_extent_mapping". But it looks like in this case we also
> > need to drop the extra ref on "em", which was held by pending_chunks
> > list. I don't see it being done anywhere else. So we should check
> > before the remove_extent_mapping() call whether "em" was in

Re: [PATCH 2/2] Btrfs: fix unprotected deletion from pending_chunks list

2019-01-21 Thread Alex Lyakas

Hi Filipe,

On Tue, Dec 2, 2014 at 8:08 PM Filipe Manana  wrote:
>
> On block group remove if the corresponding extent map was on the
> transaction->pending_chunks list, we were deleting the extent map
> from that list, through remove_extent_mapping(), without any
> synchronization with chunk allocation (which iterates that list
> and adds new elements to it). Fix this by ensure that this is done
> while the chunk mutex is held, since that's the mutex that protects
> the list in the chunk allocation code path.
>
> This applies on top (depends on) of my previous patch titled:
> "Btrfs: fix race between fs trimming and block group remove/allocation"
>
> But the issue in fact was already present before that change, it only
> became easier to hit after Josef's 3.18 patch that added automatic
> removal of empty block groups.
>
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/extent-tree.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 17d429d..a7b81b4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9524,19 +9524,25 @@ int btrfs_remove_block_group(struct 
> btrfs_trans_handle *trans,
> list_move_tail(&em->list, &root->fs_info->pinned_chunks);
> }
> spin_unlock(&block_group->lock);
> -   unlock_chunks(root);
>
> if (remove_em) {
> struct extent_map_tree *em_tree;
>
> em_tree = &root->fs_info->mapping_tree.map_tree;
> write_lock(&em_tree->lock);
> +   /*
> +* The em might be in the pending_chunks list, so make sure 
> the
> +* chunk mutex is locked, since remove_extent_mapping() will
> +* delete us from that list.
> +*/
> remove_extent_mapping(em_tree, em);
> write_unlock(&em_tree->lock);
If the "em" was in pending_chunks, it will be deleted from that list
by "remove_extent_mapping". But it looks like in this case we also
need to drop the extra ref on "em", which was held by pending_chunks
list. I don't see it being done anywhere else. So we should check
before the remove_extent_mapping() call whether "em" was in
pending_chunks, and, if yes, drop the extra ref?

Thanks,
Alex.


> /* once for the tree */
> free_extent_map(em);
> }
>
> +   unlock_chunks(root);
> +
> btrfs_put_block_group(block_group);
> btrfs_put_block_group(block_group);
>
> --
> 2.1.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list

2017-05-13 Thread Alex Lyakas

Hi Liu,

On Wed, Mar 22, 2017 at 1:40 AM, Liu Bo  wrote:
> On Sun, Mar 19, 2017 at 07:18:59PM +0200, Alex Lyakas wrote:
>> We have a commit_root_sem, which is a read-write semaphore that protects the
>> commit roots.
>> But it is also used to protect the list of caching block groups.
>>
>> As a result, while doing "slow" caching, the following issue is seen:
>>
>> Some of the caching threads are scanning the extent tree with
>> commit_root_sem
>> acquired in shared mode, with stack like:
>> [] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
>> [] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0
>> [btrfs]
>> [] read_tree_block+0x40/0x70 [btrfs]
>> [] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
>> [] btrfs_search_slot+0x3c6/0xb10 [btrfs]
>> [] caching_thread+0x1b9/0x820 [btrfs]
>> [] normal_work_helper+0xc6/0x340 [btrfs]
>> [] btrfs_cache_helper+0x12/0x20 [btrfs]
>>
>> IO requests that want to allocate space are waiting in cache_block_group()
>> to acquire the commit_root_sem in exclusive mode. But they only want to add
>> the caching control structure to the list of caching block-groups:
>> [] schedule+0x29/0x70
>> [] rwsem_down_write_failed+0x145/0x320
>> [] call_rwsem_down_write_failed+0x13/0x20
>> [] cache_block_group+0x25b/0x450 [btrfs]
>> [] find_free_extent+0xd16/0xdb0 [btrfs]
>> [] btrfs_reserve_extent+0xaf/0x160 [btrfs]
>>
>> Other caching threads want to continue their scanning, and for that they
>> are waiting to acquire commit_root_sem in shared mode. But since there are
>> IO threads that want the exclusive lock, the caching threads are unable
>> to continue the scanning, because (I presume) rw_semaphore guarantees some
>> fairness:
>> [] schedule+0x29/0x70
>> [] rwsem_down_read_failed+0xc5/0x120
>> [] call_rwsem_down_read_failed+0x14/0x30
>> [] caching_thread+0x1a1/0x820 [btrfs]
>> [] normal_work_helper+0xc6/0x340 [btrfs]
>> [] btrfs_cache_helper+0x12/0x20 [btrfs]
>> [] process_one_work+0x146/0x410
>>
>> This causes slowness of the IO, especially when there are many block groups
>> that need to be scanned for free space. In some cases it takes minutes
>> until a single IO thread is able to allocate free space.
>>
>> I don't see a deadlock here, because the caching threads that were able to
>> acquire
>> the commit_root_sem will call rwsem_is_contended() and should give up the
>> semaphore,
>> so that IO threads are able to acquire it in exclusive mode.
>>
>> However, introducing a separate mutex that protects only the list of caching
>> block groups makes things move forward much faster.
>>
>
> The problem did exist and the patch looks good to me.
>
>> This patch is based on kernel 3.18.
>> Unfortunately, I am not able to submit a patch based on one of the latest
>> kernels, because
>> here btrfs is part of the larger system, and upgrading the kernel is a
>> significant effort.
>> Hence marking the patch as RFC.
>> Hopefully, this patch still has some value to the community.
>>
>> Signed-off-by: Alex Lyakas 
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 42d11e7..74feacb 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
>> struct list_head trans_list;
>> struct list_head dead_roots;
>> struct list_head caching_block_groups;
>> +/* protects the above list */
>> +struct mutex caching_block_groups_mutex;
>>
>> spinlock_t delayed_iput_lock;
>> struct list_head delayed_iputs;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 5177954..130ec58 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
>> INIT_LIST_HEAD(&fs_info->delayed_iputs);
>> INIT_LIST_HEAD(&fs_info->delalloc_roots);
>> INIT_LIST_HEAD(&fs_info->caching_block_groups);
>> +mutex_init(&fs_info->caching_block_groups_mutex);
>> spin_lock_init(&fs_info->delalloc_root_lock);
>> spin_lock_init(&fs_info->trans_lock);
>> spin_lock_init(&fs_info->fs_roots_radix_lock);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index a067065..906fb08 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -637,10 +637,10 @@ static int cache_block_group(struct
>> btrfs_block_group_cache *cache,
>> return 0;
>> }
>>
>> -

Re: include linux kernel headers for btrfs filesystem

2017-03-20 Thread Alex Lyakas

Ilan,

On Mon, Mar 20, 2017 at 10:33 AM, Ilan Schwarts  wrote:
> I need to cast struct inode to struct btrfs_inode.
> in order to do it, i looked at implementation of btrfs_getattr.
>
> the code is simple:
> struct btrfs_inode *btrfsInode;
> btrfsInode = BTRFS_I(inode);
>
> in order to compile i must add the headers on top of the function:
> #include "/data/kernel/linux-4.1.21-x86_64/fs/btrfs/ctree.h"
> #include "/data/kernel/linux-4.1.21-x86_64/fs/btrfs/btrfs_inode.h"
>
> What is the problem ?
> I must manually download and include ctree.h and btrfs_inode.h, they
> are not provided in the kernel-headers package.
> On every platform I compile my driver, I have specific VM for the
> distro/kernel version, so on every VM I usually download package
> kernel-headers and everything compiles perfectly.
>
> btrfs was introduced in kernel 3.0 above.
> Arent the btrfs headers should be there ? do they exist in another
> package ? maybe fs-headers or something like that ?

Try using the below simple Makefile[1] to compile btrfs loadable
module. You need to have the kernel-headers package installed.
You can place the makefile anywhere you want, and compile via:
# make -f 

Thanks,
Alex.


[1]
obj-m += btrfs.o

# or substitute with hard-coded kernel version
KVERSION = $(shell uname -r)

SRC_DIR=/fs/btrfs
BTRFS_KO=btrfs.ko

# or specify any other output directory
OUT_DIR=/lib/modules/$(KVERSION)/kernel/fs/btrfs

all: $(OUT_DIR)/$(BTRFS_KO)

$(OUT_DIR)/$(BTRFS_KO): $(SRC_DIR)/$(BTRFS_KO)
cp $(SRC_DIR)/$(BTRFS_KO) $(OUT_DIR)/

$(SRC_DIR)/$(BTRFS_KO): $(SRC_DIR)/*.c $(SRC_DIR)/*.h
$(MAKE) -C /lib/modules/$(KVERSION)/build M=$(SRC_DIR) modules

clean:
$(MAKE) -C /lib/modules/$(KVERSION)/build M=$(SRC_DIR) clean
test -f $(OUT_DIR)/$(BTRFS_KO) && rm $(OUT_DIR)/$(BTRFS_KO)|| true


> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list

2017-03-19 Thread Alex Lyakas

We have a commit_root_sem, which is a read-write semaphore that protects the 
commit roots.

But it is also used to protect the list of caching block groups.

As a result, while doing "slow" caching, the following issue is seen:

Some of the caching threads are scanning the extent tree with 
commit_root_sem

acquired in shared mode, with stack like:
[] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
[] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 
[btrfs]

[] read_tree_block+0x40/0x70 [btrfs]
[] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
[] btrfs_search_slot+0x3c6/0xb10 [btrfs]
[] caching_thread+0x1b9/0x820 [btrfs]
[] normal_work_helper+0xc6/0x340 [btrfs]
[] btrfs_cache_helper+0x12/0x20 [btrfs]

IO requests that want to allocate space are waiting in cache_block_group()
to acquire the commit_root_sem in exclusive mode. But they only want to add
the caching control structure to the list of caching block-groups:
[] schedule+0x29/0x70
[] rwsem_down_write_failed+0x145/0x320
[] call_rwsem_down_write_failed+0x13/0x20
[] cache_block_group+0x25b/0x450 [btrfs]
[] find_free_extent+0xd16/0xdb0 [btrfs]
[] btrfs_reserve_extent+0xaf/0x160 [btrfs]

Other caching threads want to continue their scanning, and for that they
are waiting to acquire commit_root_sem in shared mode. But since there are
IO threads that want the exclusive lock, the caching threads are unable
to continue the scanning, because (I presume) rw_semaphore guarantees some 
fairness:

[] schedule+0x29/0x70
[] rwsem_down_read_failed+0xc5/0x120
[] call_rwsem_down_read_failed+0x14/0x30
[] caching_thread+0x1a1/0x820 [btrfs]
[] normal_work_helper+0xc6/0x340 [btrfs]
[] btrfs_cache_helper+0x12/0x20 [btrfs]
[] process_one_work+0x146/0x410

This causes slowness of the IO, especially when there are many block groups
that need to be scanned for free space. In some cases it takes minutes
until a single IO thread is able to allocate free space.

I don't see a deadlock here, because the caching threads that were able to 
acquire
the commit_root_sem will call rwsem_is_contended() and should give up the 
semaphore,

so that IO threads are able to acquire it in exclusive mode.

However, introducing a separate mutex that protects only the list of caching
block groups makes things move forward much faster.

This patch is based on kernel 3.18.
Unfortunately, I am not able to submit a patch based on one of the latest 
kernels, because
here btrfs is part of the larger system, and upgrading the kernel is a 
significant effort.

Hence marking the patch as RFC.
Hopefully, this patch still has some value to the community.

Signed-off-by: Alex Lyakas 

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 42d11e7..74feacb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
struct list_head trans_list;
struct list_head dead_roots;
struct list_head caching_block_groups;
+/* protects the above list */
+struct mutex caching_block_groups_mutex;

spinlock_t delayed_iput_lock;
struct list_head delayed_iputs;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5177954..130ec58 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
INIT_LIST_HEAD(&fs_info->delayed_iputs);
INIT_LIST_HEAD(&fs_info->delalloc_roots);
INIT_LIST_HEAD(&fs_info->caching_block_groups);
+mutex_init(&fs_info->caching_block_groups_mutex);
spin_lock_init(&fs_info->delalloc_root_lock);
spin_lock_init(&fs_info->trans_lock);
spin_lock_init(&fs_info->fs_roots_radix_lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a067065..906fb08 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -637,10 +637,10 @@ static int cache_block_group(struct 
btrfs_block_group_cache *cache,

return 0;
}

-down_write(&fs_info->commit_root_sem);
+mutex_lock(&fs_info->caching_block_groups_mutex);
atomic_inc(&caching_ctl->count);
list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
-up_write(&fs_info->commit_root_sem);
+mutex_unlock(&fs_info->caching_block_groups_mutex);

btrfs_get_block_group(cache);

@@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct 
btrfs_trans_handle *trans,


down_write(&fs_info->commit_root_sem);

+mutex_lock(&fs_info->caching_block_groups_mutex);
list_for_each_entry_safe(caching_ctl, next,
 &fs_info->caching_block_groups, list) {
cache = caching_ctl->block_group;
@@ -5704,6 +5705,7 @@ void btrfs_prepare_extent_commit(struct 
btrfs_trans_handle *trans,

cache->last_byte_to_unpin = caching_ctl->progress;
}
}
+mutex_unlock(&fs_info->caching_block_groups_mutex);

if (fs_info->pinned_extents == &fs_info->freed_extent

Re: [PATCH] Btrfs: deal with unexpected return value in flush_space

2016-10-01 Thread Alex Lyakas

David, Holger,

Thank you for picking up that old patch of mine.

Alex.


On Fri, Jul 29, 2016 at 8:53 PM, Liu Bo  wrote:
> On Fri, Jul 29, 2016 at 07:01:50PM +0200, David Sterba wrote:
>> On Thu, Jul 28, 2016 at 11:49:14AM -0700, Liu Bo wrote:
>> > > For reviewers - this came up before here:
>> > > https://patchwork.kernel.org/patch/7778651/
>
> David, this patch made a mistake in commit log.
>
>> > >
>> > > Same fix basically.
>> >
>> > Aha, I've given it my Reviewed-by.
>> >
>> > Taking either one works for me, I can make the clarifying comment into a
>> > seperate patch if we need to.
>>
>> I'll pick the first patch and please send the separate comment update.
>> Thanks.
>
> Sure.
>
> Thanks,
>
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RCF - PATCH] btrfs: do not ignore errors from primary superblock

2016-05-17 Thread Alex Lyakas


RFC: This patch not for merging, but only for review and discussion.

When mounting, we consider only the primary superblock on each device.
But when writing the superblocks, we might silently ignore errors
from the primary superblock, if we succeeded to write secondary
superblocks. In such case, the primary superblock was not updated
properly, and if we crash at this point, later mount will use
an out-of-date superblock.

This patch changes the behavior to NOT IGNORING any errors on the primary 
superblock,
and IGNORING any errors on secondary superblocks. This way, we always insist 
on having

an up-to-date primary superblock.

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4e47849..0ae9f7c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3357,11 +3357,13 @@ static int write_dev_supers(struct btrfs_device 
*device,

bh = __find_get_block(device->bdev, bytenr / 4096,
  BTRFS_SUPER_INFO_SIZE);
if (!bh) {
-errors++;
+/* we only care about primary superblock errors */
+if (i == 0)
+errors++;
continue;
}
wait_on_buffer(bh);
-if (!buffer_uptodate(bh))
+if (!buffer_uptodate(bh) && i == 0)
errors++;

/* drop our reference */
@@ -3388,9 +3390,10 @@ static int write_dev_supers(struct btrfs_device 
*device,

  BTRFS_SUPER_INFO_SIZE);
if (!bh) {
btrfs_err(device->dev_root->fs_info,
-"couldn't get super buffer head for bytenr %llu",
-bytenr);
-errors++;
+"couldn't get super buffer head for bytenr %llu (sb 
copy %d)",

+bytenr, i);
+if (i == 0)
+errors++;
continue;
}

@@ -3413,10 +3416,10 @@ static int write_dev_supers(struct btrfs_device 
*device,

ret = btrfsic_submit_bh(WRITE_FUA, bh);
else
ret = btrfsic_submit_bh(WRITE_SYNC, bh);
-if (ret)
+if (ret && i == 0)
errors++;
}
-return errors < i ? 0 : -1;
+return errors ? -1 : 0;
}

/*


P.S.: when reviewing the code of write_dev_supers(), I also noticed that 
when wait==0 and we hit an error in one __getblk(), then the caller 
(write_all_supers) will not properly wait for submitted buffer-heads to 
complete, and we won't do the additional "brelse(bh);", which wait==0 case 
does. Is this a problem?



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 6/9] Btrfs: implement the free space B-tree

2016-04-22 Thread Alex Lyakas

Hi Omar, Chris,

I have reviewed the free-space-tree code. It is a very nice feature.

However, I have a basic understanding question.

Let's say we are running a delayed ref which inserts a new EXTENT_ITEM
into the extent tree, e.g., we are in alloc_reserved_file_extent. At
this point we call remove_from_free_space_tree(), which updates the
free-space-tree about the allocated space. But this requires to COW
the free-space-tree itself. So we allocate a new tree block for the
free-space tree, and insert a new delayed ref, which will update the
extent tree about the new tree block allocation. We also insert a
delayed ref to free the previous copy of the free-space-tree block.

At some point we run these new delayed refs, so we insert/remove
EXTENT_ITEMs from the extent tree, and this in turn requires us to
update the free-space-tree again. So we need again to COW
free-space-tree blocks, generating more delayed refs.

At which point this recursion stops?

Do we assume that at some point all needed free-space tree blocks have
been COW'ed already, and we do not COW a tree block more than once per
transaction (unless it was written to disk due to memory pressure)?

Thanks!
Alex.


On Tue, Dec 29, 2015 at 11:19 PM, Chris Mason  wrote:
> On Tue, Sep 29, 2015 at 08:50:35PM -0700, Omar Sandoval wrote:
>> From: Omar Sandoval 
>>
>> The free space cache has turned out to be a scalability bottleneck on
>> large, busy filesystems. When the cache for a lot of block groups needs
>> to be written out, we can get extremely long commit times; if this
>> happens in the critical section, things are especially bad because we
>> block new transactions from happening.
>>
>> The main problem with the free space cache is that it has to be written
>> out in its entirety and is managed in an ad hoc fashion. Using a B-tree
>> to store free space fixes this: updates can be done as needed and we get
>> all of the benefits of using a B-tree: checksumming, RAID handling,
>> well-understood behavior.
>>
>> With the free space tree, we get commit times that are about the same as
>> the no cache case with load times slower than the free space cache case
>> but still much faster than the no cache case. Free space is represented
>> with extents until it becomes more space-efficient to use bitmaps,
>> giving us similar space overhead to the free space cache.
>>
>> The operations on the free space tree are: adding and removing free
>> space, handling the creation and deletion of block groups, and loading
>> the free space for a block group. We can also create the free space tree
>> by walking the extent tree and clear the free space tree.
>>
>> Signed-off-by: Omar Sandoval 
>
>> +int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
>> +{
>> + struct btrfs_trans_handle *trans;
>> + struct btrfs_root *tree_root = fs_info->tree_root;
>> + struct btrfs_root *free_space_root;
>> + struct btrfs_block_group_cache *block_group;
>> + struct rb_node *node;
>> + int ret;
>> +
>> + trans = btrfs_start_transaction(tree_root, 0);
>> + if (IS_ERR(trans))
>> + return PTR_ERR(trans);
>> +
>> + free_space_root = btrfs_create_tree(trans, fs_info,
>> + BTRFS_FREE_SPACE_TREE_OBJECTID);
>> + if (IS_ERR(free_space_root)) {
>> + ret = PTR_ERR(free_space_root);
>> + goto abort;
>> + }
>> + fs_info->free_space_root = free_space_root;
>> +
>> + node = rb_first(&fs_info->block_group_cache_tree);
>> + while (node) {
>> + block_group = rb_entry(node, struct btrfs_block_group_cache,
>> +cache_node);
>> + ret = populate_free_space_tree(trans, fs_info, block_group);
>> + if (ret)
>> + goto abort;
>> + node = rb_next(node);
>> + }
>> +
>> + btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
>> +
>> + ret = btrfs_commit_transaction(trans, tree_root);
>> + if (ret)
>> + return ret;
>> +
>> + return 0;
>> +
>> +abort:
>> + btrfs_abort_transaction(trans, tree_root, ret);
>> + btrfs_end_transaction(trans, tree_root);
>> + return ret;
>> +}
>> +
>
> Hi Omar,
>
> The only problem I've hit testing this stuff is where we create the tree
> on existing filesystems.  There are a few different problems here:
>
> 1) The populate code happens after resuming balance operations.  The
> balancing code could be changing these block groups while we scan them.
> I fixed this by moving the scan up earlier.
>
> 2) Delayed references may be run, which will also change the extent tree
> as we're scanning it.
>
> 3) We might need to commit the transaction to reclaim space.
>
> For now I'm ignoring #3 and adding a flag in fs_info that will make us
> skip delayed references.  This really isn't a good long term solution,
> we need to be able to do this on a per block group basis and make
> forward progress without pinning

Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-04-03 Thread Alex Lyakas

Hello Qu, Wang,

On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo  wrote:
>
>
> Alex Lyakas wrote on 2016/03/29 19:22 +0200:
>>
>> Greetings Qu Wenruo,
>>
>> I have reviewed the dedup patchset found in the github account you
>> mentioned. I have several questions. Please note that by all means I
>> am not criticizing your design or code. I just want to make sure that
>> my understanding of the code is proper.
>
>
> It's OK to criticize the design or code, and that's how review works.
>
>>
>> 1) You mentioned in several emails that at some point byte-to-byte
>> comparison is to be performed. However, I do not see this in the code.
>> It seems that generic_search() only looks for the hash value match. If
>> there is a match, it goes ahead and adds a delayed ref.
>
>
> I mentioned byte-to-byte comparison as, "not to be implemented in any time
> soon".
>
> Considering the lack of facility to read out extent contents without any
> inode structure, it's not going to be done in any time soon.
>
>>
>> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
>> mutex and proceed with the normal COW. What happens if there are
>> several IO streams to different files writing an identical block, but
>> we don't have such block in our dedup DB? Then all
>> btrfs_dedupe_search() calls will not find a match, so all streams will
>> allocate space for their block (which are all identical). At some
>> point, they will call insert_reserved_file_extent() and will call
>> btrfs_dedupe_add(). Since there is a global mutex, the first stream
>> will insert the dedup hash entries into the DB, and all other streams
>> will find that such hash entry already exists. So the end result is
>> that we have the hash entry in the DB, but still we have multiple
>> copies of the same block allocated, due to timing issues. Is this
>> correct?
>
>
> That's right, and that's also unavoidable for the hash initializing stage.
>
>>
>> 3) generic_search() competes with __btrfs_free_extent(). Meaning that
>> generic_search() wants to add a delayed ref to an existing extent,
>> whereas __btrfs_free_extent() wants to delete an entry from the dedup
>> DB. The race is resolved as follows:
>> - generic_search attempts to lock the delayed ref head
>> - if it succeeds to lock, then __btrfs_free_extent() is not running
>> right now. So we can add a delayed ref. Later, when delayed ref head
>> will be run, it will figure out what needs to be done (free the extent
>> or not)
>> - if we fail to lock, then there is a delayed ref processing for this
>> bytenr. We drop all locks and redo the search from the top. If
>> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will
>> not find it, and proceed with normal COW.
>> Is my understanding correct?
>
>
> Yes that's correct.

Reviewing the code again, it seems that I still lack understanding.
What is special about the dedup code adding a delayed data ref versus
other places doing that? In other places, we do not insist on locking
the delayed ref head, but in dedup we do. For example,
__btrfs_drop_extents calls btrfs_inc_extent_ref, without locking the
ref head. I know that one of your purposes was to draw attention to
delayed ref processing, so you have succeeded.

Thanks,
Alex.




>
>>
>> I have also few nitpicks on the code, will reply to relevant patches.
>
>
> Feel free to comment.
>
> Thanks,
> Qu
>
>>
>> Thanks for doing this work,
>> Alex.
>>
>>
>>
>> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo 
>> wrote:
>>>
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux.git wang_dedupe_20160322
>>>
>>> This updated version of inband de-duplication has the following features:
>>> 1) ONE unified dedup framework.
>>> Most of its code is hidden quietly in dedup.c and export the minimal
>>> interfaces for its caller.
>>> Reviewer and further developer would benefit from the unified
>>> framework.
>>>
>>> 2) TWO different back-end with different trade-off
>>> One is the improved version of previous Fujitsu in-memory only dedup.
>>> The other one is enhanced dedup implementation from Liu Bo.
>>> Changed its tree structure to handle bytenr -> hash search for
>>> deleting hash, without the hideous data backref hack.
>>>
>>> 3) Support compression with dedupe
>>> Now dedupe can work with compression.
>>> Means that,

Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-03-30 Thread Alex Lyakas

Thanks for your comments, Qu.

Alex.


On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo  wrote:
>
>
> Alex Lyakas wrote on 2016/03/29 19:22 +0200:
>>
>> Greetings Qu Wenruo,
>>
>> I have reviewed the dedup patchset found in the github account you
>> mentioned. I have several questions. Please note that by all means I
>> am not criticizing your design or code. I just want to make sure that
>> my understanding of the code is proper.
>
>
> It's OK to criticize the design or code, and that's how review works.
>
>>
>> 1) You mentioned in several emails that at some point byte-to-byte
>> comparison is to be performed. However, I do not see this in the code.
>> It seems that generic_search() only looks for the hash value match. If
>> there is a match, it goes ahead and adds a delayed ref.
>
>
> I mentioned byte-to-byte comparison as, "not to be implemented in any time
> soon".
>
> Considering the lack of facility to read out extent contents without any
> inode structure, it's not going to be done in any time soon.
>
>>
>> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
>> mutex and proceed with the normal COW. What happens if there are
>> several IO streams to different files writing an identical block, but
>> we don't have such block in our dedup DB? Then all
>> btrfs_dedupe_search() calls will not find a match, so all streams will
>> allocate space for their block (which are all identical). At some
>> point, they will call insert_reserved_file_extent() and will call
>> btrfs_dedupe_add(). Since there is a global mutex, the first stream
>> will insert the dedup hash entries into the DB, and all other streams
>> will find that such hash entry already exists. So the end result is
>> that we have the hash entry in the DB, but still we have multiple
>> copies of the same block allocated, due to timing issues. Is this
>> correct?
>
>
> That's right, and that's also unavoidable for the hash initializing stage.
>
>>
>> 3) generic_search() competes with __btrfs_free_extent(). Meaning that
>> generic_search() wants to add a delayed ref to an existing extent,
>> whereas __btrfs_free_extent() wants to delete an entry from the dedup
>> DB. The race is resolved as follows:
>> - generic_search attempts to lock the delayed ref head
>> - if it succeeds to lock, then __btrfs_free_extent() is not running
>> right now. So we can add a delayed ref. Later, when delayed ref head
>> will be run, it will figure out what needs to be done (free the extent
>> or not)
>> - if we fail to lock, then there is a delayed ref processing for this
>> bytenr. We drop all locks and redo the search from the top. If
>> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will
>> not find it, and proceed with normal COW.
>> Is my understanding correct?
>
>
> Yes that's correct.
>
>>
>> I have also few nitpicks on the code, will reply to relevant patches.
>
>
> Feel free to comment.
>
> Thanks,
> Qu
>
>>
>> Thanks for doing this work,
>> Alex.
>>
>>
>>
>> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo 
>> wrote:
>>>
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux.git wang_dedupe_20160322
>>>
>>> This updated version of inband de-duplication has the following features:
>>> 1) ONE unified dedup framework.
>>> Most of its code is hidden quietly in dedup.c and export the minimal
>>> interfaces for its caller.
>>> Reviewer and further developer would benefit from the unified
>>> framework.
>>>
>>> 2) TWO different back-end with different trade-off
>>> One is the improved version of previous Fujitsu in-memory only dedup.
>>> The other one is enhanced dedup implementation from Liu Bo.
>>> Changed its tree structure to handle bytenr -> hash search for
>>> deleting hash, without the hideous data backref hack.
>>>
>>> 3) Support compression with dedupe
>>> Now dedupe can work with compression.
>>> Means that, a dedupe miss case can be compressed, and dedupe hit case
>>> can also reuse compressed file extents.
>>>
>>> 4) Ioctl interface with persist dedup status
>>> Advised by David, now we use ioctl to enable/disable dedup.
>>>
>>> And we now have dedup status, recorded in the first item of dedup
>>> tree.
>>> Just like quota, once enabled, no extra ioctl is

Re: [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info

2016-03-29 Thread Alex Lyakas

Hi Qu, Wang,

On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo  wrote:
> Since we will introduce a new on-disk based dedupe method, introduce new
> interfaces to resume previous dedupe setup.
>
> And since we introduce a new tree for status, also add disable handler
> for it.
>
> Signed-off-by: Wang Xiaoguang 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/dedupe.c  | 269 
> +
>  fs/btrfs/dedupe.h  |  13 +++
>  fs/btrfs/disk-io.c |  21 -
>  fs/btrfs/disk-io.h |   1 +
>  4 files changed, 283 insertions(+), 21 deletions(-)
>
> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
> index 7ef2c37..1112fec 100644
> --- a/fs/btrfs/dedupe.c
> +++ b/fs/btrfs/dedupe.c
> @@ -21,6 +21,8 @@
>  #include "transaction.h"
>  #include "delayed-ref.h"
>  #include "qgroup.h"
> +#include "disk-io.h"
> +#include "locking.h"
>
>  struct inmem_hash {
> struct rb_node hash_node;
> @@ -41,10 +43,103 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 
> type)
> GFP_NOFS);
>  }
>
> +static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
> +   u16 backend, u64 blocksize, u64 limit)
> +{
> +   struct btrfs_dedupe_info *dedupe_info;
> +
> +   dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
> +   if (!dedupe_info)
> +   return -ENOMEM;
> +
> +   dedupe_info->hash_type = type;
> +   dedupe_info->backend = backend;
> +   dedupe_info->blocksize = blocksize;
> +   dedupe_info->limit_nr = limit;
> +
> +   /* only support SHA256 yet */
> +   dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
> +   if (IS_ERR(dedupe_info->dedupe_driver)) {
> +   int ret;
> +
> +   ret = PTR_ERR(dedupe_info->dedupe_driver);
> +   kfree(dedupe_info);
> +   return ret;
> +   }
> +
> +   dedupe_info->hash_root = RB_ROOT;
> +   dedupe_info->bytenr_root = RB_ROOT;
> +   dedupe_info->current_nr = 0;
> +   INIT_LIST_HEAD(&dedupe_info->lru_list);
> +   mutex_init(&dedupe_info->lock);
> +
> +   *ret_info = dedupe_info;
> +   return 0;
> +}
> +
> +static int init_dedupe_tree(struct btrfs_fs_info *fs_info,
> +   struct btrfs_dedupe_info *dedupe_info)
> +{
> +   struct btrfs_root *dedupe_root;
> +   struct btrfs_key key;
> +   struct btrfs_path *path;
> +   struct btrfs_dedupe_status_item *status;
> +   struct btrfs_trans_handle *trans;
> +   int ret;
> +
> +   path = btrfs_alloc_path();
> +   if (!path)
> +   return -ENOMEM;
> +
> +   trans = btrfs_start_transaction(fs_info->tree_root, 2);
> +   if (IS_ERR(trans)) {
> +   ret = PTR_ERR(trans);
> +   goto out;
> +   }
> +   dedupe_root = btrfs_create_tree(trans, fs_info,
> +  BTRFS_DEDUPE_TREE_OBJECTID);
> +   if (IS_ERR(dedupe_root)) {
> +   ret = PTR_ERR(dedupe_root);
> +   btrfs_abort_transaction(trans, fs_info->tree_root, ret);
> +   goto out;
> +   }
> +   dedupe_info->dedupe_root = dedupe_root;
> +
> +   key.objectid = 0;
> +   key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
> +   key.offset = 0;
> +
> +   ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
> + sizeof(*status));
> +   if (ret < 0) {
> +   btrfs_abort_transaction(trans, fs_info->tree_root, ret);
> +   goto out;
> +   }
> +
> +   status = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +   struct btrfs_dedupe_status_item);
> +   btrfs_set_dedupe_status_blocksize(path->nodes[0], status,
> +dedupe_info->blocksize);
> +   btrfs_set_dedupe_status_limit(path->nodes[0], status,
> +   dedupe_info->limit_nr);
> +   btrfs_set_dedupe_status_hash_type(path->nodes[0], status,
> +   dedupe_info->hash_type);
> +   btrfs_set_dedupe_status_backend(path->nodes[0], status,
> +   dedupe_info->backend);
> +   btrfs_mark_buffer_dirty(path->nodes[0]);
> +out:
> +   btrfs_free_path(path);
> +   if (ret == 0)
> +   btrfs_commit_transaction(trans, fs_info->tree_root);
> +   return ret;
> +}
> +
>  int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
> u64 blocksize, u64 limit_nr)
>  {
> struct btrfs_dedupe_info *dedupe_info;
> +   int create_tree;
> +   u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
> u64 limit = limit_nr;
> int ret = 0;
>
> @@ -63,6 +158,14 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, 
> u16 type, u16 backend,
> limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
> if (backend == BTRFS_DEDUPE_BACKEND_ONDISK &

Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-03-29 Thread Alex Lyakas

Greetings Qu Wenruo,

I have reviewed the dedup patchset found in the github account you
mentioned. I have several questions. Please note that by all means I
am not criticizing your design or code. I just want to make sure that
my understanding of the code is proper.

1) You mentioned in several emails that at some point byte-to-byte
comparison is to be performed. However, I do not see this in the code.
It seems that generic_search() only looks for the hash value match. If
there is a match, it goes ahead and adds a delayed ref.

2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
mutex and proceed with the normal COW. What happens if there are
several IO streams to different files writing an identical block, but
we don't have such block in our dedup DB? Then all
btrfs_dedupe_search() calls will not find a match, so all streams will
allocate space for their block (which are all identical). At some
point, they will call insert_reserved_file_extent() and will call
btrfs_dedupe_add(). Since there is a global mutex, the first stream
will insert the dedup hash entries into the DB, and all other streams
will find that such hash entry already exists. So the end result is
that we have the hash entry in the DB, but still we have multiple
copies of the same block allocated, due to timing issues. Is this
correct?

3) generic_search() competes with __btrfs_free_extent(). Meaning that
generic_search() wants to add a delayed ref to an existing extent,
whereas __btrfs_free_extent() wants to delete an entry from the dedup
DB. The race is resolved as follows:
- generic_search attempts to lock the delayed ref head
- if it succeeds to lock, then __btrfs_free_extent() is not running
right now. So we can add a delayed ref. Later, when delayed ref head
will be run, it will figure out what needs to be done (free the extent
or not)
- if we fail to lock, then there is a delayed ref processing for this
bytenr. We drop all locks and redo the search from the top. If
__btrfs_free_extent() has deleted the dedup hash meanwhile, we will
not find it, and proceed with normal COW.
Is my understanding correct?

I have also few nitpicks on the code, will reply to relevant patches.

Thanks for doing this work,
Alex.



On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo  wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux.git wang_dedupe_20160322
>
> This updated version of inband de-duplication has the following features:
> 1) ONE unified dedup framework.
>Most of its code is hidden quietly in dedup.c and export the minimal
>interfaces for its caller.
>Reviewer and further developer would benefit from the unified
>framework.
>
> 2) TWO different back-end with different trade-off
>One is the improved version of previous Fujitsu in-memory only dedup.
>The other one is enhanced dedup implementation from Liu Bo.
>Changed its tree structure to handle bytenr -> hash search for
>deleting hash, without the hideous data backref hack.
>
> 3) Support compression with dedupe
>Now dedupe can work with compression.
>Means that, a dedupe miss case can be compressed, and dedupe hit case
>can also reuse compressed file extents.
>
> 4) Ioctl interface with persist dedup status
>Advised by David, now we use ioctl to enable/disable dedup.
>
>And we now have dedup status, recorded in the first item of dedup
>tree.
>Just like quota, once enabled, no extra ioctl is needed for next
>mount.
>
> 5) Ability to disable dedup for given dirs/files
>It works just like the compression prop method, by adding a new
>xattr.
>
> TODO:
> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
>Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>CPU may even be a bottleneck other than IO.
>But for faster hash, it will definitely cause conflicts, so we need
>extent comparison before we introduce new dedup algorithm.
>
> 2) Misc end-user related helpers
>Like handy and easy to implement dedup rate report.
>And method to query in-memory hash size for those "non-exist" users who
>want to use 'dedup enable -l' option but didn't ever know how much
>RAM they have.
>
> Changelog:
> v2:
>   Totally reworked to handle multiple backends
> v3:
>   Fix a stupid but deadly on-disk backend bug
>   Add handle for multiple hash on same bytenr corner case to fix abort
>   trans error
>   Increase dedup rate by enhancing delayed ref handler for both backend.
>   Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
>   Increase dedup block size up limit to 8M.
> v4:
>   Add dedup prop for disabling dedup for given files/dirs.
>   Merge inmem_search() and ondisk_search() into generic_search() to save
>   some code
>   Fix another delayed_ref related bug.
>   Use the same mutex for both inmem and ondisk backend.
>   Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
>   rate.
> v5

Re: [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk

2016-03-13 Thread Alex Lyakas

Nicholas,

On Sat, Mar 12, 2016 at 12:19 AM, Nicholas D Steeves  wrote:
> On 10 March 2016 at 06:10, Alex Lyakas  wrote:
>> csum_dirty_buffer was issuing a warning in case the extent buffer
>> did not look alright, but was still returning success.
>> Let's return error in this case, and also add an additional sanity
>> check on the extent buffer header.
>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio 
>> will,
>> but it is better than to have a silent metadata corruption on disk.
>
> Does this mean there is a good chance that everyone has corrupted
> metadata?
No, this definitely does not.

The code that I added prevents btrfs from writing a metadata block, if
it somehow got corrupted before being sent to disk. If it happens, it
indicates a bug somewhere in the kernel. For example, if some other
kernel module erroneously uses a page-cache entry, which does not
belong to it (and contains btrfs metadata block or part of it).

> Is there any way to verify/rebuild it without wipefs+mkfs+restore from 
> backups?
To verify btrfs metadata: unmount the filesystem and run "btrfs check
...". Do not specify the "repair" parameter. Another way to verify is
to run "btrfs-debug-tree" and redirect its standard output to
/dev/null. It should not print anything to standard error. But "btrfs
check" is faster.

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC - PATCH] btrfs: do not write corrupted metadata blocks to disk

2016-03-10 Thread Alex Lyakas

Hello Filipe,

I have sent two patches addressing this issue.

When testing, I discovered that log tree blocks can sometimes carry
chunk tree UUID which is all zeros! Does this make sense? You can take
a look at a small debug-tree output demonstrating such phenomenon at
https://drive.google.com/file/d/0B9rmyUifdvMLbHBuSWU5dlVKNWc. Due to
this I did not include the chunk tree UUID check. Hoping very much
that fs UUID should always be valid for all tree blocks))

Thanks,
Alex.



On Mon, Feb 22, 2016 at 12:28 PM, Filipe Manana  wrote:
> On Mon, Feb 22, 2016 at 9:46 AM, Alex Lyakas  wrote:
>> Thank you, Filipe, for your review.
>>
>> On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana  wrote:
>>> On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas  wrote:
>>>> csum_dirty_buffer was issuing a warning in case the extent buffer
>>>> did not look alright, but was still returning success.
>>>> Let's return error in this case, and also add two additional sanity
>>>> checks on the extent buffer header.
>>>>
>>>> We had btrfs metadata corruption, and after looking at the logs we saw
>>>> that WARN_ON(found_start != start) has been triggered. We are still
>>>> investigating
>>>
>>> There's a warning for WARN_ON(found_start != start || !PageUptodate(page))
>>>
>>> Are you sure it triggered only because of found_start != start and not
>>> because of !PageUptodate(page) (or both)?
>> The problem initially happened on kernel 3.8.13.  In this kernel, the
>> code looks like this:
>>  found_start = btrfs_header_bytenr(eb);
>>  if (found_start != start) {
>>  WARN_ON(1);
>>  return 0;
>>  }
>>  if (!PageUptodate(page)) {
>>  WARN_ON(1);
>>  return 0;
>>  }
>> (You can see it on
>> http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420)
>> The WARN_ON that we hit was on the found_start comparison.
>
> Ok, I see now that one of those useless cleanup patches merged both
> conditions into a single if some time ago.
>
>>
>>>
>>>> which component trashed the cache page which belonged to btrfs. But btrfs
>>>> only issued a warning, and as a result, the corrupted metadata block went 
>>>> to
>>>> disk.
>>>>
>>>> I think we should return an error in such case that the extent buffer
>>>> doesn't look alright.
>>>
>>> I think so too.
>>>
>>>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio
>>>> will,
>>>> but it is better than to have a silent metadata corruption on disk.
>>>>
>>>> Note: this patch has been properly tested on 3.18 kernel only.
>>>>
>>>> Signed-off-by: Alex Lyakas 
>>>> ---
>>>> fs/btrfs/disk-io.c | 14 --
>>>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>> index 4545e2e..701e706 100644
>>>> --- a/fs/btrfs/disk-io.c
>>>> +++ b/fs/btrfs/disk-io.c
>>>> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info
>>>> *fs_info, struct page *page)
>>>> {
>>>> u64 start = page_offset(page);
>>>> u64 found_start;
>>>> struct extent_buffer *eb;
>>>>
>>>> eb = (struct extent_buffer *)page->private;
>>>> if (page != eb->pages[0])
>>>> return 0;
>>>> found_start = btrfs_header_bytenr(eb);
>>>> if (WARN_ON(found_start != start || !PageUptodate(page)))
>>>> -return 0;
>>>> -csum_tree_block(fs_info, eb, 0);
>>>> +return -EUCLEAN;
>>>> +#ifdef CONFIG_BTRFS_ASSERT
>>>
>>> A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do 
>>> assertions.
>>> I would remove this #ifdef ... #endif or do the memcmp calls inside 
>>> ASSERT().
>> Agreed.
>>
>>>
>>>> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>>>> +(unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
>>>> +return -EUCLEAN;
>>>> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>>>> +(unsigned long)btrfs_header_chunk_tree_uuid(eb),
>>>> +BTRFS_FSID_SIZE)))
>>>
>>> This seco

[PATCH 1/2] btrfs: csum_tree_block: return proper errno value

2016-03-10 Thread Alex Lyakas

Signed-off-by: Alex Lyakas 
---
 fs/btrfs/disk-io.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4545e2e..4420ab2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -296,52 +296,52 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
unsigned long map_len;
int err;
u32 crc = ~(u32)0;
unsigned long inline_result;
 
len = buf->len - offset;
while (len > 0) {
err = map_private_extent_buffer(buf, offset, 32,
&kaddr, &map_start, &map_len);
if (err)
-   return 1;
+   return err;
cur_len = min(len, map_len - (offset - map_start));
crc = btrfs_csum_data(kaddr + offset - map_start,
  crc, cur_len);
len -= cur_len;
offset += cur_len;
}
if (csum_size > sizeof(inline_result)) {
result = kzalloc(csum_size, GFP_NOFS);
if (!result)
-   return 1;
+   return -ENOMEM;
} else {
result = (char *)&inline_result;
}
 
btrfs_csum_final(crc, result);
 
if (verify) {
if (memcmp_extent_buffer(buf, result, 0, csum_size)) {
u32 val;
u32 found = 0;
memcpy(&found, result, csum_size);
 
read_extent_buffer(buf, &val, 0, csum_size);
btrfs_warn_rl(fs_info,
"%s checksum verify failed on %llu wanted %X 
found %X "
"level %d",
fs_info->sb->s_id, buf->start,
val, found, btrfs_header_level(buf));
if (result != (char *)&inline_result)
kfree(result);
-   return 1;
+   return -EUCLEAN;
}
} else {
write_extent_buffer(buf, result, 0, csum_size);
}
if (result != (char *)&inline_result)
kfree(result);
return 0;
 }
 
 /*
@@ -509,22 +509,21 @@ static int csum_dirty_buffer(struct btrfs_fs_info 
*fs_info, struct page *page)
u64 start = page_offset(page);
u64 found_start;
struct extent_buffer *eb;
 
eb = (struct extent_buffer *)page->private;
if (page != eb->pages[0])
return 0;
found_start = btrfs_header_bytenr(eb);
if (WARN_ON(found_start != start || !PageUptodate(page)))
return 0;
-   csum_tree_block(fs_info, eb, 0);
-   return 0;
+   return csum_tree_block(fs_info, eb, 0);
 }
 
 static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
 struct extent_buffer *eb)
 {
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
u8 fsid[BTRFS_UUID_SIZE];
int ret = 1;
 
read_extent_buffer(eb, fsid, btrfs_header_fsid(), BTRFS_FSID_SIZE);
@@ -653,24 +652,22 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
btrfs_err(root->fs_info, "bad tree block level %d",
   (int)btrfs_header_level(eb));
ret = -EIO;
goto err;
}
 
btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb),
   eb, found_level);
 
ret = csum_tree_block(root->fs_info, eb, 1);
-   if (ret) {
-   ret = -EIO;
+   if (ret)
goto err;
-   }
 
/*
 * If this is a leaf block and it is corrupt, set the corrupt bit so
 * that we don't try and read the other copies of this block, just
 * return -EIO.
 */
if (found_level == 0 && check_leaf(root, eb)) {
set_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
ret = -EIO;
}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk

2016-03-10 Thread Alex Lyakas

csum_dirty_buffer was issuing a warning in case the extent buffer
did not look alright, but was still returning success.
Let's return error in this case, and also add an additional sanity
check on the extent buffer header.
The caller up the chain may BUG_ON on this, for example flush_epd_write_bio 
will,
but it is better than to have a silent metadata corruption on disk.

Signed-off-by: Alex Lyakas 
---
 fs/btrfs/disk-io.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4420ab2..cf85714 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -506,23 +506,34 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_root *root,
 
 static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 {
u64 start = page_offset(page);
u64 found_start;
struct extent_buffer *eb;
 
eb = (struct extent_buffer *)page->private;
if (page != eb->pages[0])
return 0;
+
found_start = btrfs_header_bytenr(eb);
-   if (WARN_ON(found_start != start || !PageUptodate(page)))
-   return 0;
+   /*
+* Please do not consolidate these warnings into a single if.
+* It is useful to know what went wrong.
+*/
+   if (WARN_ON(found_start != start))
+   return -EUCLEAN;
+   if (WARN_ON(!PageUptodate(page)))
+   return -EUCLEAN;
+
+   ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
+   btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
+
return csum_tree_block(fs_info, eb, 0);
 }
 
 static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
 struct extent_buffer *eb)
 {
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
u8 fsid[BTRFS_UUID_SIZE];
int ret = 1;
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC - PATCH] btrfs: do not write corrupted metadata blocks to disk

2016-02-22 Thread Alex Lyakas

Thank you, Filipe, for your review.

On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana  wrote:
> On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas  wrote:
>> csum_dirty_buffer was issuing a warning in case the extent buffer
>> did not look alright, but was still returning success.
>> Let's return error in this case, and also add two additional sanity
>> checks on the extent buffer header.
>>
>> We had btrfs metadata corruption, and after looking at the logs we saw
>> that WARN_ON(found_start != start) has been triggered. We are still
>> investigating
>
> There's a warning for WARN_ON(found_start != start || !PageUptodate(page))
>
> Are you sure it triggered only because of found_start != start and not
> because of !PageUptodate(page) (or both)?
The problem initially happened on kernel 3.8.13.  In this kernel, the
code looks like this:
 found_start = btrfs_header_bytenr(eb);
 if (found_start != start) {
 WARN_ON(1);
 return 0;
 }
 if (!PageUptodate(page)) {
 WARN_ON(1);
 return 0;
 }
(You can see it on
http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420)
The WARN_ON that we hit was on the found_start comparison.

>
>> which component trashed the cache page which belonged to btrfs. But btrfs
>> only issued a warning, and as a result, the corrupted metadata block went to
>> disk.
>>
>> I think we should return an error in such case that the extent buffer
>> doesn't look alright.
>
> I think so too.
>
>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio
>> will,
>> but it is better than to have a silent metadata corruption on disk.
>>
>> Note: this patch has been properly tested on 3.18 kernel only.
>>
>> Signed-off-by: Alex Lyakas 
>> ---
>> fs/btrfs/disk-io.c | 14 --
>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 4545e2e..701e706 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info
>> *fs_info, struct page *page)
>> {
>> u64 start = page_offset(page);
>> u64 found_start;
>> struct extent_buffer *eb;
>>
>> eb = (struct extent_buffer *)page->private;
>> if (page != eb->pages[0])
>> return 0;
>> found_start = btrfs_header_bytenr(eb);
>> if (WARN_ON(found_start != start || !PageUptodate(page)))
>> -return 0;
>> -csum_tree_block(fs_info, eb, 0);
>> +return -EUCLEAN;
>> +#ifdef CONFIG_BTRFS_ASSERT
>
> A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do 
> assertions.
> I would remove this #ifdef ... #endif or do the memcmp calls inside ASSERT().
Agreed.

>
>> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>> +(unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
>> +return -EUCLEAN;
>> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>> +(unsigned long)btrfs_header_chunk_tree_uuid(eb),
>> +BTRFS_FSID_SIZE)))
>
> This second comparison doesn't seem correct. Second argument to
> memcmp_extent_buffer should be fs_info->chunk_tree_uuid, which
> shouldn't be the same as the fsid (take a look at utils.c:make_btrfs()
> in the tools, both uuids are generated by different calls to
> uuid_generate()) - did you make your tests only before adding this
> comparison?. Also you should use BTRFS_UUID_SIZE instead of
> BTRFS_FSID_SIZE (even if both have the same value).
Obviously, you are right. In the 3.18-based code that I fixed locally
here, the fix looks like this:

if (found_start != start) {
ZBTRFS_WARN(1, "FS[%s]: header_bytenr(eb)(%llu) !=
page->index<fs_info->sb->s_id,
found_start, start);
return -EUCLEAN;
}
if (!PageUptodate(page)) {
ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu page->index(%llu)
!PageUptodate", root->fs_info->sb->s_id, start, (u64)page->index);
return -EUCLEAN;
}
if (memcmp_extent_buffer(eb, root->fs_info->fsid, (unsigned
long)btrfs_header_fsid(), BTRFS_FSID_SIZE)) {
u8 hdr_fsid[BTRFS_FSID_SIZE] = {0};

read_extent_buffer(eb, hdr_fsid, (unsigned
long)btrfs_header_fsid(), BTRFS_FSID_SIZE);
ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu header->fsid["PRIX128"]
!= fs_info->fsid["PRIX128"]", root->fs_info->sb->s_id, start,
PRI_UUID(hdr_fsi

[RFC - PATCH] btrfs: do not write corrupted metadata blocks to disk

2016-02-21 Thread Alex Lyakas


csum_dirty_buffer was issuing a warning in case the extent buffer
did not look alright, but was still returning success.
Let's return error in this case, and also add two additional sanity
checks on the extent buffer header.

We had btrfs metadata corruption, and after looking at the logs we saw
that WARN_ON(found_start != start) has been triggered. We are still 
investigating

which component trashed the cache page which belonged to btrfs. But btrfs
only issued a warning, and as a result, the corrupted metadata block went to 
disk.


I think we should return an error in such case that the extent buffer 
doesn't look alright.
The caller up the chain may BUG_ON on this, for example flush_epd_write_bio 
will,

but it is better than to have a silent metadata corruption on disk.

Note: this patch has been properly tested on 3.18 kernel only.

Signed-off-by: Alex Lyakas 
---
fs/btrfs/disk-io.c | 14 --
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4545e2e..701e706 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info 
*fs_info, struct page *page)

{
u64 start = page_offset(page);
u64 found_start;
struct extent_buffer *eb;

eb = (struct extent_buffer *)page->private;
if (page != eb->pages[0])
return 0;
found_start = btrfs_header_bytenr(eb);
if (WARN_ON(found_start != start || !PageUptodate(page)))
-return 0;
-csum_tree_block(fs_info, eb, 0);
+return -EUCLEAN;
+#ifdef CONFIG_BTRFS_ASSERT
+if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
+(unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
+return -EUCLEAN;
+if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
+(unsigned long)btrfs_header_chunk_tree_uuid(eb),
+BTRFS_FSID_SIZE)))
+return -EUCLEAN;
+#endif
+if (csum_tree_block(fs_info, eb, 0))
+return -EUCLEAN;
return 0;
}

static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
 struct extent_buffer *eb)
{
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
u8 fsid[BTRFS_UUID_SIZE];
int ret = 1;

--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-12-13 Thread Alex Lyakas

Thank you, Filipe. Now it is more clear.
Fortunately, in my kernel 3.18 I do not have do_chunk_alloc() calling
btrfs_create_pending_block_groups(), so I cannot hit this deadlock.
But can hit the issue that this call is meant to fix.

Thanks,
Alex.


On Sun, Dec 13, 2015 at 5:45 PM, Filipe Manana  wrote:
> On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas  wrote:
>> Hi Filipe Manana,
>>
>> Can't the call to btrfs_create_pending_block_groups() cause a
>> deadlock, like in
>> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
>> call updates the device tree, and we may be calling do_chunk_alloc()
>> from find_free_extent() when holding a lock on the device tree root
>> (because we want to COW a block of the device tree).
>>
>> My understanding from Josef's chunk allocator rework
>> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
>> when allocating a new chunk we do not immediately update the
>> device/chunk tree. We keep the new chunk in "pending_chunks" and in
>> "new_bgs" on a transaction handle, and we actually update the
>> chunk/device tree only when we are done with a particular transaction
>> handle. This way we avoid that sort of deadlocks.
>>
>> But this patch breaks this rule, as it may make us update the
>> device/chunk tree in the context of chunk allocation, which is the
>> scenario that the rework was meant to avoid.
>>
>> Can you please point me at what I am missing?
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff
>
>>
>> Thanks,
>> Alex.
>>
>>
>> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval  wrote:
>>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
>>>> From: Filipe Manana 
>>>>
>>>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>>>> finishing block group creation"), introduced in 4.2-rc1, the following
>>>> test was failing due to exhaustion of the system array in the superblock:
>>>>
>>>>   #!/bin/bash
>>>>
>>>>   truncate -s 100T big.img
>>>>   mkfs.btrfs big.img
>>>>   mount -o loop big.img /mnt/loop
>>>>
>>>>   num=5
>>>>   sz=10T
>>>>   for ((i = 0; i < $num; i++)); do
>>>>   echo fallocate $i $sz
>>>>   fallocate -l $sz /mnt/loop/testfile$i
>>>>   done
>>>>   btrfs filesystem sync /mnt/loop
>>>>
>>>>   for ((i = 0; i < $num; i++)); do
>>>> echo rm $i
>>>> rm /mnt/loop/testfile$i
>>>> btrfs filesystem sync /mnt/loop
>>>>   done
>>>>   umount /mnt/loop
>>>>
>>>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>>>> allocation of system block groups. This happened because the test creates
>>>> a large number of data block groups per transaction and when committing
>>>> the transaction we start the writeout of the block group caches for all
>>>> the new new (dirty) block groups, which results in pre-allocating space
>>>> for each block group's free space cache using the same transaction handle.
>>>> That in turn often leads to creation of more block groups, and all get
>>>> attached to the new_bgs list of the same transaction handle to the point
>>>> of getting a list with over 1500 elements, and creation of new block groups
>>>> leads to the need of reserving space in the chunk block reserve and often
>>>> creating a new system block group too.
>>>>
>>>> So that made us quickly exhaust the chunk block reserve/system space info,
>>>> because as of the commit mentioned before, we do reserve space for each
>>>> new block group in the chunk block reserve, unlike before where we would
>>>> not and would at most allocate one new system block group and therefore
>>>> would only ensure that there was enough space in the system space info to
>>>> allocate 1 new block group even if we ended up allocating thousands of
>>>> new block groups using the same transaction handle. That worked most of
>>>> the time because the computed required space at check_system_chunk() is
>>>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>>>> that all nodes/leafs in a path will be COWed and split) and since the
>>>> updates to the chunk tree all happen at btrfs_cr

Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references

2015-12-13 Thread Alex Lyakas

Hi Filipe,

Thank you for the explanation.

On Sun, Dec 13, 2015 at 5:43 PM, Filipe Manana  wrote:
> On Sun, Dec 13, 2015 at 10:51 AM, Alex Lyakas  wrote:
>> Hi Filipe Manana,
>>
>> My understanding of selecting delayed refs to run or merging them is
>> far from complete. Can you please explain what will happen in the
>> following scenario:
>>
>> 1) Ref1 is created, as you explain
>> 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
>> up with an EXTENT_ITEM and an inline extent back ref
>> 3) Ref2 and Ref3 are added
>> 4) Somebody calls __btrfs_run_delayed_refs()
>>
>> At this point, we cannot merge Ref2 and Ref3, because they might be
>> referencing tree blocks of completely different trees, thus
>> comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
>> run, because we prefer BTRFS_ADD_DELAYED_REF over
>> BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
>> now, because we already have Ref1 in the extent tree.
>
> No, that won't happen. If the ref (Ref3) is for a different tree, than
> it has a different inline extent from Ref1
> (lookup_inline_extent_backref returns -ENOENT and not 0).
Understood. So in this case, we will first add inline ref for Ref3,
and later drop the Ref1 inline ref via update_inline_extent_backref()
by truncating the EXTENT_ITEM. All in the same transaction.


>
> If they are all for the same tree it means Ref3 is not merged with
> Ref2 because they have different seq numbers and a seq value exist in
> fs_info->tree_mod_seq_list, and we skip Ref3 through
> btrfs_check_delayed_seq() until such seq number goes away from
> tree_mod_seq_list.
Ok, so we won't process this ref-head at all, until the "seq problem"
disappears.

> If no seq number exists in tree_mod_seq_list then
> we merge it (Ref3) through btrfs_merge_delayed_refs(), called when
> running delayed refs, with Ref2 (which removes both refs since one is
> "-1" and the other "+1").
So in this case we don't care that the inline ref we have in the
EXTENT_ITEM was actually inserted on behalf of Ref1. Because it's for
the same EXTENT_ITEM and for the same root. So Ref3 and Ref1 are fully
equivalent. Interesting.

Thanks!
Alex.

>
> Iow, after this regression fix, no behaviour changed from releases before 4.2.
>
>>
>> So something should prevent us from running Ref3 before running Ref2.
>> We should run Ref2 first, which should get rid of the EXTENT_ITEM and
>> the inline backref, and then run Ref3 to create a new backref with a
>> proper owner. What is that something?
>>
>> Can you please point me at what am I missing?
>>
>> Also, can such scenario happen in 3.18 kernel, which still has an
>> rbtree per ref-head? Looking at the code, I don't see anything
>> preventing that from happening.
>>
>> Thanks,
>> Alex.
>>
>>
>> On Sun, Oct 25, 2015 at 8:51 PM,   wrote:
>>> From: Filipe Manana 
>>>
>>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>>> references implementation in order to fix certain problems with qgroups.
>>> However that rework introduced one more regression that leads to the
>>> following trace when running delayed references for metadata:
>>>
>>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>>> [35908.065201] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor 
>>> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache 
>>> sunrpc loop fuse parport_pc psmouse i2
>>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW   
>>> 4.3.0-rc5-btrfs-next-17+ #1
>>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>>> [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: 
>>> 88010c4c8000
>>> [35908.065201] RIP: 0010:[]  [] 
>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201] RSP: 0018:88010c4cbb08  EFLAGS: 00010293
>>> [35908.065201] RAX:  RBX: 88008a661000 RCX: 
>>> 
>>> [35908.065201] RDX: a04dd58f RSI: 0001 RDI: 
>>> 
>>> [35908.065201] RBP: 88010c4cbb40 R08: 1000 R09: 
>>> 88010c4cb9f8
>>> [35908.065201] R10: 0

Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references

2015-12-13 Thread Alex Lyakas

Hi Filipe Manana,

My understanding of selecting delayed refs to run or merging them is
far from complete. Can you please explain what will happen in the
following scenario:

1) Ref1 is created, as you explain
2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
up with an EXTENT_ITEM and an inline extent back ref
3) Ref2 and Ref3 are added
4) Somebody calls __btrfs_run_delayed_refs()

At this point, we cannot merge Ref2 and Ref3, because they might be
referencing tree blocks of completely different trees, thus
comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
run, because we prefer BTRFS_ADD_DELAYED_REF over
BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
now, because we already have Ref1 in the extent tree.

So something should prevent us from running Ref3 before running Ref2.
We should run Ref2 first, which should get rid of the EXTENT_ITEM and
the inline backref, and then run Ref3 to create a new backref with a
proper owner. What is that something?

Can you please point me at what am I missing?

Also, can such scenario happen in 3.18 kernel, which still has an
rbtree per ref-head? Looking at the code, I don't see anything
preventing that from happening.

Thanks,
Alex.


On Sun, Oct 25, 2015 at 8:51 PM,   wrote:
> From: Filipe Manana 
>
> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
> references implementation in order to fix certain problems with qgroups.
> However that rework introduced one more regression that leads to the
> following trace when running delayed references for metadata:
>
> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
> [35908.065201] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor 
> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc 
> loop fuse parport_pc psmouse i2
> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW 
>   4.3.0-rc5-btrfs-next-17+ #1
> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: 
> 88010c4c8000
> [35908.065201] RIP: 0010:[]  [] 
> insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201] RSP: 0018:88010c4cbb08  EFLAGS: 00010293
> [35908.065201] RAX:  RBX: 88008a661000 RCX: 
> 
> [35908.065201] RDX: a04dd58f RSI: 0001 RDI: 
> 
> [35908.065201] RBP: 88010c4cbb40 R08: 1000 R09: 
> 88010c4cb9f8
> [35908.065201] R10:  R11: 002c R12: 
> 
> [35908.065201] R13: 88020a74c578 R14:  R15: 
> 
> [35908.065201] FS:  () GS:88023edc() 
> knlGS:
> [35908.065201] CS:  0010 DS:  ES:  CR0: 8005003b
> [35908.065201] CR2: 015e8708 CR3: 000102185000 CR4: 
> 06e0
> [35908.065201] Stack:
> [35908.065201]  88010c4cbb18 0f37 88020a74c578 
> 88015a408000
> [35908.065201]  880154a44000  0005 
> 88010c4cbbd8
> [35908.065201]  a0492b9a 0005  
> 
> [35908.065201] Call Trace:
> [35908.065201]  [] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
> [35908.065201]  [] ? __btrfs_run_delayed_refs+0x4d4/0xd33 
> [btrfs]
> [35908.065201]  [] __btrfs_run_delayed_refs+0xafa/0xd33 
> [btrfs]
> [35908.065201]  [] ? join_transaction.isra.10+0x25/0x41f 
> [btrfs]
> [35908.065201]  [] ? join_transaction.isra.10+0xa8/0x41f 
> [btrfs]
> [35908.065201]  [] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
> [35908.065201]  [] delayed_ref_async_start+0x3c/0x7b [btrfs]
> [35908.065201]  [] normal_work_helper+0x14c/0x32a [btrfs]
> [35908.065201]  [] btrfs_extent_refs_helper+0x12/0x14 
> [btrfs]
> [35908.065201]  [] process_one_work+0x24a/0x4ac
> [35908.065201]  [] worker_thread+0x206/0x2c2
> [35908.065201]  [] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201]  [] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201]  [] kthread+0xef/0xf7
> [35908.065201]  [] ? kthread_parkme+0x24/0x24
> [35908.065201]  [] ret_from_fork+0x3f/0x70
> [35908.065201]  [] ? kthread_parkme+0x24/0x24
> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 
> 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 
> 0b 4c 8b 45 30 8b 4d 28 45 31
> [35908.065201] RIP  [] 
> insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201]  RSP 
> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>
> This happens because the new delayed references code no longer merges
> delayed references that have different sequence values. The following
> steps are an example seq

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-12-13 Thread Alex Lyakas

Hi Filipe Manana,

Can't the call to btrfs_create_pending_block_groups() cause a
deadlock, like in
http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
call updates the device tree, and we may be calling do_chunk_alloc()
from find_free_extent() when holding a lock on the device tree root
(because we want to COW a block of the device tree).

My understanding from Josef's chunk allocator rework
(http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
when allocating a new chunk we do not immediately update the
device/chunk tree. We keep the new chunk in "pending_chunks" and in
"new_bgs" on a transaction handle, and we actually update the
chunk/device tree only when we are done with a particular transaction
handle. This way we avoid that sort of deadlocks.

But this patch breaks this rule, as it may make us update the
device/chunk tree in the context of chunk allocation, which is the
scenario that the rework was meant to avoid.

Can you please point me at what I am missing?

Thanks,
Alex.


On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval  wrote:
> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>> finishing block group creation"), introduced in 4.2-rc1, the following
>> test was failing due to exhaustion of the system array in the superblock:
>>
>>   #!/bin/bash
>>
>>   truncate -s 100T big.img
>>   mkfs.btrfs big.img
>>   mount -o loop big.img /mnt/loop
>>
>>   num=5
>>   sz=10T
>>   for ((i = 0; i < $num; i++)); do
>>   echo fallocate $i $sz
>>   fallocate -l $sz /mnt/loop/testfile$i
>>   done
>>   btrfs filesystem sync /mnt/loop
>>
>>   for ((i = 0; i < $num; i++)); do
>> echo rm $i
>> rm /mnt/loop/testfile$i
>> btrfs filesystem sync /mnt/loop
>>   done
>>   umount /mnt/loop
>>
>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>> allocation of system block groups. This happened because the test creates
>> a large number of data block groups per transaction and when committing
>> the transaction we start the writeout of the block group caches for all
>> the new new (dirty) block groups, which results in pre-allocating space
>> for each block group's free space cache using the same transaction handle.
>> That in turn often leads to creation of more block groups, and all get
>> attached to the new_bgs list of the same transaction handle to the point
>> of getting a list with over 1500 elements, and creation of new block groups
>> leads to the need of reserving space in the chunk block reserve and often
>> creating a new system block group too.
>>
>> So that made us quickly exhaust the chunk block reserve/system space info,
>> because as of the commit mentioned before, we do reserve space for each
>> new block group in the chunk block reserve, unlike before where we would
>> not and would at most allocate one new system block group and therefore
>> would only ensure that there was enough space in the system space info to
>> allocate 1 new block group even if we ended up allocating thousands of
>> new block groups using the same transaction handle. That worked most of
>> the time because the computed required space at check_system_chunk() is
>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>> that all nodes/leafs in a path will be COWed and split) and since the
>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>> it is unlikely that a path needs to be COWed more than once (unless
>> writepages() for the btree inode is called by mm in between) and that
>> compensated for the need of creating any new nodes/leads in the chunk
>> tree.
>>
>> So fix this by ensuring we don't accumulate a too large list of new block
>> groups in a transaction's handles new_bgs list, inserting/updating the
>> chunk tree for all accumulated new block groups and releasing the unused
>> space from the chunk block reserve whenever the list becomes sufficiently
>> large. This is a generic solution even though the problem currently can
>> only happen when starting the writeout of the free space caches for all
>> dirty block groups (btrfs_start_dirty_block_groups()).
>>
>> Reported-by: Omar Sandoval 
>> Signed-off-by: Filipe Manana 
>
> Thanks a lot for taking a look.
>
> Tested-by: Omar Sandoval 
>
>> ---
>>  fs/btrfs/extent-tree.c | 18 ++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 171312d..07204bf 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4227,6 +4227,24 @@ out:
>>   space_info->chunk_alloc = 0;
>>   spin_unlock(&space_info->lock);
>>   mutex_unlock(&fs_info->chunk_mutex);
>> + /*
>> +  * When we allocate a new chunk we reserve space in the chunk block
>> +  * reserve to make sure we can COW nodes/leafs in the chunk tree or
>> +  * add new n

Re: [PATCH V2] Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state

2015-12-13 Thread Alex Lyakas

[Resending in plain text, apologies.]

Hi Chandan, Josef, Chris,

I am not sure I understand the fix to the problem.

It may happen that when updating the device tree, we need to allocate a new
chunk via do_chunk_alloc (while we are holding the device tree root node
locked). This is a legitimate thing for find_free_extent() to do. And
do_chunk_alloc() call may lead to call to
btrfs_create_pending_block_groups(), which will try to update the device
tree. This may happen due to direct call to
btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or
perhaps by __btrfs_end_transaction() that find_free_extent() does after it
completed chunk allocation (although in this case it will use the
transaction that already exists in current->journal_info).
So the deadlock still may happen?

Thanks,
 Alex.

>
>
> On Mon, Nov 2, 2015 at 6:52 PM, Chris Mason  wrote:
>>
>> On Mon, Nov 02, 2015 at 01:59:46PM +0530, Chandan Rajendra wrote:
>> > When executing generic/001 in a loop on a ppc64 machine (with both
>> > sectorsize
>> > and nodesize set to 64k), the following call trace is observed,
>>
>> Thanks Chandan, I hit this same trace on x86-64 with 16K nodes.
>>
>> -chris
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] btrfs: flush_space: treat return value of do_chunk_alloc properly

2015-12-06 Thread Alex Lyakas

do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
But flush_space will not convert this to 0, and will also return 1.
As a result, reserve_metadata_bytes will think that flush_space failed,
and may potentially return this value "1" to the caller (depends how
reserve_metadata_bytes was called). The caller will also treat this as an error.
For example, btrfs_block_rsv_refill does:

int ret = -ENOSPC;
...
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
return 0;
}

return ret;

So it will return -ENOSPC.

Signed-off-by: Alex Lyakas 
Reviewed-by: Josef Bacik 

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4b89680..1ba3f0d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root,
 btrfs_get_alloc_profile(root, 0),
 CHUNK_ALLOC_NO_FORCE);
btrfs_end_transaction(trans, root);
-   if (ret == -ENOSPC)
+   if (ret > 0 || ret == -ENOSPC)
ret = 0;
break;
case COMMIT_TRANS:

On Sun, Dec 6, 2015 at 12:19 PM, Alex Lyakas  wrote:
> Hi Liu,
> I was studying on how block reservation works, and making some
> modifications in reserve_metadata_bytes to understand better what it
> does. Then suddenly I saw this problem. I guess it depends on which
> value of "flush" parameter is passed to reserve_metadata_bytes.
>
> Alex.
>
>
> On Thu, Dec 3, 2015 at 8:14 PM, Liu Bo  wrote:
>> On Thu, Dec 03, 2015 at 06:51:03PM +0200, Alex Lyakas wrote:
>>> do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
>>> But flush_space will not convert this to 0, and will also return 1.
>>> As a result, reserve_metadata_bytes will think that flush_space failed,
>>> and may potentially return this value "1" to the caller (depends how
>>> reserve_metadata_bytes was called). The caller will also treat this as an 
>>> error.
>>> For example, btrfs_block_rsv_refill does:
>>>
>>> int ret = -ENOSPC;
>>> ...
>>> ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
>>> if (!ret) {
>>> block_rsv_add_bytes(block_rsv, num_bytes, 0);
>>> return 0;
>>> }
>>>
>>> return ret;
>>>
>>> So it will return -ENOSPC.
>>
>> It will return 1 instead of -ENOSPC.
>>
>> The patch looks good, I noticed this before, but I didn't manage to trigger 
>> a error for this, did you catch a error like that?
>>
>> Thanks,
>>
>> -liubo
>>
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 4b89680..1ba3f0d 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root,
>>>  btrfs_get_alloc_profile(root, 0),
>>>  CHUNK_ALLOC_NO_FORCE);
>>> btrfs_end_transaction(trans, root);
>>> -   if (ret == -ENOSPC)
>>> +   if (ret > 0 || ret == -ENOSPC)
>>> ret = 0;
>>> break;
>>> case COMMIT_TRANS:
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] btrfs: flush_space: treat return value of do_chunk_alloc properly

2015-12-06 Thread Alex Lyakas

Hi Liu,
I was studying on how block reservation works, and making some
modifications in reserve_metadata_bytes to understand better what it
does. Then suddenly I saw this problem. I guess it depends on which
value of "flush" parameter is passed to reserve_metadata_bytes.

Alex.


On Thu, Dec 3, 2015 at 8:14 PM, Liu Bo  wrote:
> On Thu, Dec 03, 2015 at 06:51:03PM +0200, Alex Lyakas wrote:
>> do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
>> But flush_space will not convert this to 0, and will also return 1.
>> As a result, reserve_metadata_bytes will think that flush_space failed,
>> and may potentially return this value "1" to the caller (depends how
>> reserve_metadata_bytes was called). The caller will also treat this as an 
>> error.
>> For example, btrfs_block_rsv_refill does:
>>
>> int ret = -ENOSPC;
>> ...
>> ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
>> if (!ret) {
>> block_rsv_add_bytes(block_rsv, num_bytes, 0);
>> return 0;
>> }
>>
>> return ret;
>>
>> So it will return -ENOSPC.
>
> It will return 1 instead of -ENOSPC.
>
> The patch looks good, I noticed this before, but I didn't manage to trigger a 
> error for this, did you catch a error like that?
>
> Thanks,
>
> -liubo
>
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 4b89680..1ba3f0d 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root,
>>  btrfs_get_alloc_profile(root, 0),
>>  CHUNK_ALLOC_NO_FORCE);
>> btrfs_end_transaction(trans, root);
>> -   if (ret == -ENOSPC)
>> +   if (ret > 0 || ret == -ENOSPC)
>> ret = 0;
>> break;
>> case COMMIT_TRANS:
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] btrfs: flush_space: treat return value of do_chunk_alloc properly

2015-12-03 Thread Alex Lyakas

do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
But flush_space will not convert this to 0, and will also return 1.
As a result, reserve_metadata_bytes will think that flush_space failed,
and may potentially return this value "1" to the caller (depends how
reserve_metadata_bytes was called). The caller will also treat this as an error.
For example, btrfs_block_rsv_refill does:

int ret = -ENOSPC;
...
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
return 0;
}

return ret;

So it will return -ENOSPC.

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4b89680..1ba3f0d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root,
 btrfs_get_alloc_profile(root, 0),
 CHUNK_ALLOC_NO_FORCE);
btrfs_end_transaction(trans, root);
-   if (ret == -ENOSPC)
+   if (ret > 0 || ret == -ENOSPC)
ret = 0;
break;
case COMMIT_TRANS:
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: clear bio reference after submit_one_bio()

2015-11-07 Thread Alex Lyakas

Hi Holger,
I think it will cause an invalid paging request, just like in case
that Naohiro has fixed.
I am not running the "latest and greatest" btrfs in my system, and it
is not easy to set it up, that's why I cannot submit patches based on
the latest code, I can only review and comment on patches.

Alex.


On Thu, Nov 5, 2015 at 3:08 PM, Holger Hoffstätte
 wrote:
> On 10/11/15 20:09, Alex Lyakas wrote:
>> Hi Naota,
>>
>> What happens if btrfs_bio_alloc() in submit_extent_page fails? Then we
>> return -ENOMEM to the caller, but we do not set *bio_ret to NULL. And
>> if *bio_ret was non-NULL upon entry into submit_extent_page, then we
>> had submitted this bio before getting to btrfs_bio_alloc(). So should
>> btrfs_bio_alloc() failure be handled in the same way?
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 3915c94..cd443bc 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2834,8 +2834,11 @@ static int submit_extent_page(int rw, struct
>> extent_io_tree *tree,
>>
>> bio = btrfs_bio_alloc(bdev, sector, BIO_MAX_PAGES,
>> GFP_NOFS | __GFP_HIGH);
>> -   if (!bio)
>> +   if (!bio) {
>> +   if (bio_ret)
>> +   *bio_ret = NULL;
>> return -ENOMEM;
>> +   }
>>
>> bio_add_page(bio, page, page_size, offset);
>> bio->bi_end_io = end_io_func;
>>
>
> Did you get any feedback on this? It seems it could cause data loss or
> corruption on allocation failures, no?
>
> -h
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: throttle delayed refs better

2015-10-14 Thread Alex Lyakas

Hi Josef,
Looking at the latest Linus tree, I still see:

if (actual_count > 0) {
u64 runtime = ktime_to_ns(ktime_sub(ktime_get(), start));
...
avg = fs_info->avg_delayed_ref_runtime * 3 + runtime;
   avg = div64_u64(avg, 4);

So we need to divide "runtime" by "actual_count" before accounting it
in "avg_delayed_ref_runtime"?

Thanks,
Alex.


On Thu, Feb 27, 2014 at 5:56 PM, Josef Bacik  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 02/27/2014 10:38 AM, 钱凯 wrote:
>> I'm a little confused of what "avg_delayed_ref_runtime" means.
>>
>> In __btrfs_run_delayed_refs(), "avg_delayed_ref_runtime" is set to
>> the runtime of all delayed refs processed in current transaction
>> commit. However, in btrfs_should_throttle_delayed_refs(), we based
>> on the following condition to decide whether throttle refs or not:
>> * avg_runtime =
>> fs_info->avg_delayed_ref_runtime; if (num_entries * avg_runtime >=
>> NSEC_PER_SEC) return 1; *
>> It looks like "avg_delayed_ref_runtime" is used as runtime of each
>> delayed ref processed in average here. So what does it really
>> means?
>>
>
> Yeah I screwed this up, I should have been dividing the total time by
> the number of delayed refs I ran.  I have a patch locally to fix it
> and I'll send it out after I finish my qgroup work.  Thanks,
>
> Josef
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJTD2AlAAoJEANb+wAKly3BQkEP/0F/LGGDsO+x63SAFh/apRZo
> ZVmzi1yJGiArFImFs8IwZHKgr/HpP9yYYFqyDCTSYrErI32bjpPbSDKlFDiIKYBq
> 6mTptPlC6AJQcMJf3oV2SqUoQxI6Ea+04QaTtZwE5pDaTZsjD47QYfSyw/i+YwOr
> Ds11ayDeU3FSj8JVYDKFg5ZBifv/mIHbh1fb8xc4R5XCWsbRzIL9LiQa9c56EEOq
> vzXp57TIetbJdliK0cYQtPkA7R40us8TqVBH5MfcZPgITyBun3e0zrGxWmW6caTs
> viejEbqDhyHLHCing+mMI6GX7w16duq5oG+w4nnjjyuMzWAyNN2pxloqQsWwOyv8
> 7+33JZCtVG/txRMIXkvc3bqzetrUyPAruo+M3pstN7B2dph6TDV0QJSFnxee6mKf
> 4/zseNOJtQqjHe5QJNcVJtkDaxgGBkSONHLm5Gz8rFU3XKcNZQcocV+0EtIjE7Zs
> D5oDYCAyrxG1VKoFWhdaS883PDokRr75jcnFui4GhhFr5OAOdS3OOTLKVizWUag1
> O11d9XsjnzLWiVTsZH+f4K0ONQcUwJFV0zADgYsXtU2LDHHNIPZX9+qSAa+L66hT
> Ki6hocoZ4cXyGWcTZPtlGHxAmV2kEh8/Tr1ePfwy7FzTrg9hWUGLXY0DliQDPmIB
> w3TdOa+Ghjl8dcaGc2rX
> =kSsY
> -END PGP SIGNATURE-
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: clear bio reference after submit_one_bio()

2015-10-11 Thread Alex Lyakas

Hi Naota,

What happens if btrfs_bio_alloc() in submit_extent_page fails? Then we
return -ENOMEM to the caller, but we do not set *bio_ret to NULL. And
if *bio_ret was non-NULL upon entry into submit_extent_page, then we
had submitted this bio before getting to btrfs_bio_alloc(). So should
btrfs_bio_alloc() failure be handled in the same way?

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3915c94..cd443bc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2834,8 +2834,11 @@ static int submit_extent_page(int rw, struct
extent_io_tree *tree,

bio = btrfs_bio_alloc(bdev, sector, BIO_MAX_PAGES,
GFP_NOFS | __GFP_HIGH);
-   if (!bio)
+   if (!bio) {
+   if (bio_ret)
+   *bio_ret = NULL;
return -ENOMEM;
+   }

bio_add_page(bio, page, page_size, offset);
bio->bi_end_io = end_io_func;


Thanks,
Alex.

On Wed, Jan 7, 2015 at 12:46 AM, Satoru Takeuchi
 wrote:
> Hi Naota,
>
> On 2015/01/06 1:01, Naohiro Aota wrote:
>> After submit_one_bio(), `bio' can go away. However submit_extent_page()
>> leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM).
>> It will cause invalid paging request when submit_extent_page() is called
>> next time.
>>
>> I reproduced ENOMEM case with the following script (need
>> CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS).
>
> I confirmed that this problem reproduce with 3.19-rc3 and
> not reproduce with 3.19-rc3 with your patch.
>
> Tested-by: Satoru Takeuchi 
>
> Thank you for reporting this problem with the reproducer
> and fixing it too.
>
>   NOTE:
>   I used v3.19-rc3's tools/testing/fault-injection/failcmd.sh
>   for the following "./failcmd.sh".
>
>   >./failcmd.sh -p $percent -t $times -i $interval \
>   >--ignore-gfp-highmem=N --ignore-gfp-wait=N 
> --min-order=0 \
>   >-- \
>   >cat $directory/file > /dev/null
>
> * 3.19-rc1 + your patch
>
> ===
> # ./run
> 512+0 records in
> 512+0 records out
> #
> ===
>
> * 3.19-rc3
>
> ===
> # ./run
> 512+0 records in
> 512+0 records out
> [  188.433726] run (776): drop_caches: 1
> [  188.682372] FAULT_INJECTION: forcing a failure.
> name fail_page_alloc, interval 100, probability 111000, space 0, times 3
> [  188.689986] CPU: 0 PID: 954 Comm: cat Not tainted 3.19.0-rc3-ktest #1
> [  188.693834] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Bochs 01/01/2011
> [  188.698466]  0064 88007b343618 816e5563 
> 88007fc0fc78
> [  188.702730]  81c655c0 88007b343638 813851b5 
> 0010
> [  188.707043]  0002 88007b343768 81188126 
> 88007b3435a8
> [  188.711283] Call Trace:
> [  188.712620]  [] dump_stack+0x45/0x57
> [  188.715330]  [] should_fail+0x135/0x140
> [  188.718218]  [] __alloc_pages_nodemask+0xd6/0xb30
> [  188.721567]  [] ? blk_rq_map_sg+0x35/0x170
> [  188.724558]  [] ? virtio_queue_rq+0x145/0x2b0 
> [virtio_blk]
> [  188.728191]  [] ? 
> btrfs_submit_compressed_read+0xcf/0x4d0 [btrfs]
> [  188.732079]  [] ? kmem_cache_alloc+0x1cb/0x230
> [  188.735153]  [] ? mempool_alloc_slab+0x15/0x20
> [  188.738188]  [] alloc_pages_current+0x9a/0x120
> [  188.741153]  [] btrfs_submit_compressed_read+0x1a9/0x4d0 
> [btrfs]
> [  188.744835]  [] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
> [  188.748225]  [] ? lookup_extent_mapping+0x13/0x20 [btrfs]
> [  188.751547]  [] ? btrfs_get_extent+0x98/0xad0 [btrfs]
> [  188.754656]  [] submit_one_bio+0x67/0xa0 [btrfs]
> [  188.757554]  [] submit_extent_page.isra.35+0xd7/0x1c0 
> [btrfs]
> [  188.760981]  [] __do_readpage+0x31d/0x7b0 [btrfs]
> [  188.763920]  [] ? btrfs_create_repair_bio+0x110/0x110 
> [btrfs]
> [  188.767382]  [] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs]
> [  188.770671]  [] ? btrfs_lookup_ordered_range+0x13d/0x180 
> [btrfs]
> [  188.774366]  [] 
> __extent_readpages.constprop.42+0x2ba/0x2d0 [btrfs]
> [  188.778031]  [] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs]
> [  188.781241]  [] extent_readpages+0x169/0x1b0 [btrfs]
> [  188.784322]  [] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs]
> [  188.789014]  [] btrfs_readpages+0x1f/0x30 [btrfs]
> [  188.792028]  [] __do_page_cache_readahead+0x18c/0x1f0
> [  188.795078]  [] ondemand_readahead+0xdf/0x260
> [  188.797702]  [] ? btrfs_congested_fn+0x5f/0xa0 [btrfs]
> [  188.800718]  [] page_cache_async_readahead+0x71/0xa0
> [  188.803650]  [] generic_file_read_iter+0x40f/0x5e0
> [  188.806480]  [] new_sync_read+0x7e/0xb0
> [  188.808832]  [] __vfs_read+0x18/0x50
> [  188.811068]  [] vfs_read+0x8a/0x140
> [  188.813298]  [] SyS_read+0x46/0xb0
> [  188.815486]  [] ? __audit_syscall_exit+0x1f6/0x2a0
> [  188

Re: [PATCH] Btrfs: check pending chunks when shrinking fs to avoid corruption

2015-09-30 Thread Alex Lyakas

Hi Filipe,

Looking the code of this patch, I see that if we discover a pending
chunk, we unlock the chunk mutex, commit the transaction (which
completes the allocation of all pending chunks and inserts relevant
items into the device tree and chunk tree), and retry the search.

However, after we unlock the chunk mutex, somebody could have
attempted a new chunk allocation, which would have resulted in new
pending chunk. On the other hand, we have done:

btrfs_device_set_total_bytes(device, new_size);

so this line should prevent anybody to allocate beyond the new size.
In that case, we are sure that on the seconds pass there will be no
pending chunks beyond the new size, so we can shrink to new_size
safely. Is my understanding correct?

Thanks,
Alex.



On Tue, Jun 2, 2015 at 3:43 PM,   wrote:
> From: Filipe Manana 
>
> When we shrink the usable size of a device (its total_bytes), we go over
> all the device extent items in the device tree and attempt to relocate
> the chunk of any device extent that goes beyond the new usable size for
> the device. We do that after setting the new usable size (total_bytes) in
> the device object, so that all new allocations (and reallocations) don't
> use areas of the device that go beyond the new (shorter) size. However we
> were not considering that before setting the new size in the device,
> pending chunks might have been created that use device extents that go
> beyond the new size, and those device extents are not yet in the device
> tree after we search the device tree - they are still attached to the
> list of new block group for some ongoing transaction handle, and they are
> only added to the device tree when the transaction handle is ended (via
> btrfs_create_pending_block_groups()).
>
> So check for pending chunks with device extents that go beyond the new
> size and if any exists, commit the current transaction and repeat the
> search in the device tree.
>
> Not doing this it would mean we would return success to user space while
> still having extents that go beyond the new size, and later user space
> could override those locations on the device while the fs still references
> them, causing all sorts of corruption and unexpected events.
>
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/volumes.c | 49 -
>  1 file changed, 40 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index dbea12e..09e89a6 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3984,6 +3984,7 @@ int btrfs_shrink_device(struct btrfs_device *device, 
> u64 new_size)
> int slot;
> int failed = 0;
> bool retried = false;
> +   bool checked_pending_chunks = false;
> struct extent_buffer *l;
> struct btrfs_key key;
> struct btrfs_super_block *super_copy = root->fs_info->super_copy;
> @@ -4064,15 +4065,6 @@ again:
> goto again;
> } else if (failed && retried) {
> ret = -ENOSPC;
> -   lock_chunks(root);
> -
> -   btrfs_device_set_total_bytes(device, old_size);
> -   if (device->writeable)
> -   device->fs_devices->total_rw_bytes += diff;
> -   spin_lock(&root->fs_info->free_chunk_lock);
> -   root->fs_info->free_chunk_space += diff;
> -   spin_unlock(&root->fs_info->free_chunk_lock);
> -   unlock_chunks(root);
> goto done;
> }
>
> @@ -4084,6 +4076,35 @@ again:
> }
>
> lock_chunks(root);
> +
> +   /*
> +* We checked in the above loop all device extents that were already 
> in
> +* the device tree. However before we have updated the device's
> +* total_bytes to the new size, we might have had chunk allocations 
> that
> +* have not complete yet (new block groups attached to transaction
> +* handles), and therefore their device extents were not yet in the
> +* device tree and we missed them in the loop above. So if we have any
> +* pending chunk using a device extent that overlaps the device range
> +* that we can not use anymore, commit the current transaction and
> +* repeat the search on the device tree - this way we guarantee we 
> will
> +* not have chunks using device extents that end beyond 'new_size'.
> +*/
> +   if (!checked_pending_chunks) {
> +   u64 start = new_size;
> +   u64 len = old_size - new_size;
> +
> +   if (contains_pending_extent(trans, device, &start, len)) {
> +   unlock_chunks(root);
> +   checked_pending_chunks = true;
> +   failed = 0;
> +   retried = false;
> +   ret = btrfs_commit_transaction(trans, root);
> +   if (ret)
> +   goto done;
> +

Re: [PATCH v5 04/18] btrfs: Add threshold workqueue based on kernel workqueue

2015-08-19 Thread Alex Lyakas

Hi Qu,


On Fri, Feb 28, 2014 at 4:46 AM, Qu Wenruo  wrote:
> The original btrfs_workers has thresholding functions to dynamically
> create or destroy kthreads.
>
> Though there is no such function in kernel workqueue because the worker
> is not created manually, we can still use the workqueue_set_max_active
> to simulated the behavior, mainly to achieve a better HDD performance by
> setting a high threshold on submit_workers.
> (Sadly, no resource can be saved)
>
> So in this patch, extra workqueue pending counters are introduced to
> dynamically change the max active of each btrfs_workqueue_struct, hoping
> to restore the behavior of the original thresholding function.
>
> Also, workqueue_set_max_active use a mutex to protect workqueue_struct,
> which is not meant to be called too frequently, so a new interval
> mechanism is applied, that will only call workqueue_set_max_active after
> a count of work is queued. Hoping to balance both the random and
> sequence performance on HDD.
>
> Signed-off-by: Qu Wenruo 
> Tested-by: David Sterba 
> ---
> Changelog:
> v2->v3:
>   - Add thresholding mechanism to simulate the old thresholding mechanism.
>   - Will not enable thresholding when thresh is set to small value.
> v3->v4:
>   None
> v4->v5:
>   None
> ---
>  fs/btrfs/async-thread.c | 107 
> 
>  fs/btrfs/async-thread.h |   3 +-
>  2 files changed, 101 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 193c849..977bce2 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -30,6 +30,9 @@
>  #define WORK_ORDER_DONE_BIT 2
>  #define WORK_HIGH_PRIO_BIT 3
>
> +#define NO_THRESHOLD (-1)
> +#define DFT_THRESHOLD (32)
> +
>  /*
>   * container for the kthread task pointer and the list of pending work
>   * One of these is allocated per thread.
> @@ -737,6 +740,14 @@ struct __btrfs_workqueue_struct {
>
> /* Spinlock for ordered_list */
> spinlock_t list_lock;
> +
> +   /* Thresholding related variants */
> +   atomic_t pending;
> +   int max_active;
> +   int current_max;
> +   int thresh;
> +   unsigned int count;
> +   spinlock_t thres_lock;
>  };
>
>  struct btrfs_workqueue_struct {
> @@ -745,19 +756,34 @@ struct btrfs_workqueue_struct {
>  };
>
>  static inline struct __btrfs_workqueue_struct
> -*__btrfs_alloc_workqueue(char *name, int flags, int max_active)
> +*__btrfs_alloc_workqueue(char *name, int flags, int max_active, int thresh)
>  {
> struct __btrfs_workqueue_struct *ret = kzalloc(sizeof(*ret), 
> GFP_NOFS);
>
> if (unlikely(!ret))
> return NULL;
>
> +   ret->max_active = max_active;
> +   atomic_set(&ret->pending, 0);
> +   if (thresh == 0)
> +   thresh = DFT_THRESHOLD;
> +   /* For low threshold, disabling threshold is a better choice */
> +   if (thresh < DFT_THRESHOLD) {
> +   ret->current_max = max_active;
> +   ret->thresh = NO_THRESHOLD;
> +   } else {
> +   ret->current_max = 1;
> +   ret->thresh = thresh;
> +   }
> +
> if (flags & WQ_HIGHPRI)
> ret->normal_wq = alloc_workqueue("%s-%s-high", flags,
> -max_active, "btrfs", name);
> +ret->max_active,
> +"btrfs", name);
> else
> ret->normal_wq = alloc_workqueue("%s-%s", flags,
> -max_active, "btrfs", name);
> +ret->max_active, "btrfs",
> +name);
Shouldn't we use ret->current_max instead of ret->max_active (in both calls)?
According to the rest of the code, "max_active" is the absolute
maximum beyond which the "normal_wq" cannot go (you use clamp_value to
ensure that). And "current_max" is the current value of "max_active"
of the "normal_wq". But here, you set the "normal_wq" to "max_active"
immediately. Is this intentional?


> if (unlikely(!ret->normal_wq)) {
> kfree(ret);
> return NULL;
> @@ -765,6 +791,7 @@ static inline struct __btrfs_workqueue_struct
>
> INIT_LIST_HEAD(&ret->ordered_list);
> spin_lock_init(&ret->list_lock);
> +   spin_lock_init(&ret->thres_lock);
> return ret;
>  }
>
> @@ -773,7 +800,8 @@ __btrfs_destroy_workqueue(struct __btrfs_workqueue_struct 
> *wq);
>
>  struct btrfs_workqueue_struct *btrfs_alloc_workqueue(char *name,
>  int flags,
> -int max_active)
> +int max_active,
> +int thresh)
>  {
> struct btrfs_workqueue_struct *ret = kzalloc(

Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-21 Thread Alex Lyakas

entially (the commit thread).
- before we get to the critical section of the commit, we can have
other threads also running delayed refs, so the commit thread needs to
compete on tree-block
locks with them (and they hold the locks because they also read tree
blocks from disk as it seems)

So my question is shouldn't we be much more aggressive in
__btrfs_end_transaction, running delayed refs several times and
checking trans->delayed_ref_updates after each run, and return only if
this number is zero or small enough.
This way when we trigger a commit, it will not have a lot of delayed
refs to run, it will get very quickly to the critical section, pass it
hopefully very quickly (get to TRANS_STATE_UNBLOCKED), and then we can
open a new transaction while the previous is doing
btrfs_write_and_wait_transaction.
That's what I wanted to ask.

Thanks!
Alex.


[1] In my case, btrfs metadata is ~10GBs and the machine has 8GB of
RAM. Due to this we need to read a lot of ebs from disk, as they are
not in the page cache. Also need to keep in mind that every COW of eb
requires a new slot in the page cache, because we index by "bytenr"
that we receive from the free-space cache, which is a "logical"
coordinate by which EXTENT_ITEMs are sorted in the extent tree.



On Mon, Jul 13, 2015 at 7:02 PM, Chris Mason  wrote:
> On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote:
>> Filipe,
>> Thanks for the explanation. Those reasons were not so obvious for me.
>>
>> Would it make sense not to COW the block in case-1, if we are mounted
>> with "notreelog"? Or, perhaps, to check that the block does not belong
>> to a log tree?
>>
>
> Hi Alex,
>
> The crc rules are the most important, we have to make sure the block
> isn't changed while it is in flight.  Also, think about something like
> this:
>
> transaction write block A, puts pointer to it in the btree, generation Y
>
> 
>
> transaction rewrites block A, same generation Y
>
> 
>
> Later on, we try to read block A again.  We find it has the correct crc
> and the correct generation number, but the contents are actually wrong.
>
>> The second case is more difficult. One problem is that
>> BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
>> due to memory pressure (this is what I see happening), we complete the
>> writeback, release the extent buffer, and pages are evicted from the
>> page cache of btree_inode. After some time we read the block again
>> (because we want to modify it in the same transaction), but its header
>> is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
>> this point it should be safe to avoid COW, we will re-COW.
>>
>> Would it make sense to have some runtime-only mechanism to lock-out
>> the write-back for an eb? I.e., if we know that eb is not under
>> writeback, and writeback is locked out from starting, we can redirty
>> the block without COW. Then we allow the writeback to start when it
>> wants to.
>>
>> In one of my test runs, btrfs had 6.4GB of metadata (before
>> raid-induced overhead), but during a particular transaction total of
>> 10GB of metadata (again, before raid-induced overhead) was written to
>> disk. (Thisis  total of all ebs having
>> header->generation==curr_transid, not only during commit of the
>> transaction). This particular run was with "notreelog".
>>
>> Machine had 8GB of RAM. Linux allows the btree_inode to grow its
>> page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages).
>> But even though the used amount of metadata is less than that, this
>> re-COW'ing of already-COW'ed blocks seems to cause page-cache
>> trashing...
>
> Interesting.  We've addressed this in the past with changes to the
> writepage(s) callback for the btree, basically skipping memory pressure
> related writeback if there isn't that much dirty.  There is a lot of
> room to improve those decisions, like preferring to write leaves over
> nodes, especially full leaves that are not likely to change again.
>
> -chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-13 Thread Alex Lyakas

Filipe,
Thanks for the explanation. Those reasons were not so obvious for me.

Would it make sense not to COW the block in case-1, if we are mounted
with "notreelog"? Or, perhaps, to check that the block does not belong
to a log tree?

The second case is more difficult. One problem is that
BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
due to memory pressure (this is what I see happening), we complete the
writeback, release the extent buffer, and pages are evicted from the
page cache of btree_inode. After some time we read the block again
(because we want to modify it in the same transaction), but its header
is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
this point it should be safe to avoid COW, we will re-COW.

Would it make sense to have some runtime-only mechanism to lock-out
the write-back for an eb? I.e., if we know that eb is not under
writeback, and writeback is locked out from starting, we can redirty
the block without COW. Then we allow the writeback to start when it
wants to.

In one of my test runs, btrfs had 6.4GB of metadata (before
raid-induced overhead), but during a particular transaction total of
10GB of metadata (again, before raid-induced overhead) was written to
disk. (Thisis  total of all ebs having
header->generation==curr_transid, not only during commit of the
transaction). This particular run was with "notreelog".

Machine had 8GB of RAM. Linux allows the btree_inode to grow its
page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages).
But even though the used amount of metadata is less than that, this
re-COW'ing of already-COW'ed blocks seems to cause page-cache
trashing...

Thanks,
Alex.

On Mon, Jul 13, 2015 at 11:27 AM, Filipe David Manana
 wrote:
> On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas  wrote:
>> Greetings,
>> Looking at the code of should_cow_block(), I see:
>>
>> if (btrfs_header_generation(buf) == trans->transid &&
>>!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
>> ...
>> So if the extent buffer has been written to disk, and now is changed again
>> in the same transaction, we insist on COW'ing it. Can anybody explain why
>> COW is needed in this case? The transaction has not committed yet, so what
>> is the danger of rewriting to the same location on disk? My understanding
>> was that a tree block needs to be COW'ed at most once in the same
>> transaction. But I see that this is not the case.
>
> That logic is there, as far as I can see, for at least 2 obvious reasons:
>
> 1) fsync/log trees. All extent buffers (tree blocks) of a log tree
> have the same transaction id/generation, and you can have multiple
> fsyncs (log transaction commits) per transaction so you need to ensure
> consistency. If we skipped the COWing in the example below, you would
> get an inconsistent log tree at log replay time when the fs is
> mounted:
>
> transaction N start
>
>fsync inode A start
>creates tree block X
>flush X to disk
>write a new superblock
>fsync inode A end
>
>fsync inode B start
>skip COW of X because its generation == current transaction id and
> modify it in place
>flush X to disk
>
> == crash ===
>
>write a new superblock
>fsync inode B end
>
> transaction N commit
>
> 2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is
> written to disk but instead when we trigger writeback for it. So while
> the writeback is ongoing we want to make sure the block's content
> isn't concurrently modified (we don't keep the eb write locked to
> allow concurrent reads during the writeback).
>
> All tree blocks that don't belong to a log tree are normally written
> only when at the end of a transaction commit. But often, due to memory
> pressure for e.g., the VM can call the writepages() callback of the
> btree inode to force dirty tree blocks to be written to disk before
> the transaction commit.
>
>>
>> I am asking because I am doing some profiling of btrfs metadata work under
>> heavy loads, and I see that sometimes btrfs COW's almost twice more tree
>> blocks than the total metadata size.
>>
>> Thanks,
>> Alex.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-12 Thread Alex Lyakas


Greetings,
Looking at the code of should_cow_block(), I see:

if (btrfs_header_generation(buf) == trans->transid &&
   !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
...
So if the extent buffer has been written to disk, and now is changed again 
in the same transaction, we insist on COW'ing it. Can anybody explain why 
COW is needed in this case? The transaction has not committed yet, so what 
is the danger of rewriting to the same location on disk? My understanding 
was that a tree block needs to be COW'ed at most once in the same 
transaction. But I see that this is not the case.


I am asking because I am doing some profiling of btrfs metadata work under 
heavy loads, and I see that sometimes btrfs COW's almost twice more tree 
blocks than the total metadata size.


Thanks,
Alex.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: rebuild missing block group during chunk recovery if possible

2014-12-24 Thread Alex Lyakas

Hi Qu,

On Wed, Dec 24, 2014 at 3:09 AM, Qu Wenruo  wrote:
>
>  Original Message 
> Subject: Re: [PATCH] btrfs-progs: rebuild missing block group during chunk
> recovery if possible
> From: Alex Lyakas 
> To: Qu Wenruo 
> Date: 2014年12月24日 00:49
>>
>> Hi Qu,
>>
>> On Thu, Oct 30, 2014 at 4:54 AM, Qu Wenruo 
>> wrote:
>>>
>>> [snipped]
>>> +
>>> +static int __insert_block_group(struct btrfs_trans_handle *trans,
>>> +   struct chunk_record *chunk_rec,
>>> +   struct btrfs_root *extent_root,
>>> +   u64 used)
>>> +{
>>> +   struct btrfs_block_group_item bg_item;
>>> +   struct btrfs_key key;
>>> +   int ret = 0;
>>> +
>>> +   btrfs_set_block_group_used(&bg_item, used);
>>> +   btrfs_set_block_group_chunk_objectid(&bg_item, used);
>>
>> This looks like a bug. Instead of "used", I think it should be
>> "BTRFS_FIRST_CHUNK_TREE_OBJECTID".
>
> Oh, my mistake, BTRFS_FIRST_CHUNK_TREE_OBJECTID is right.
> Thanks for pointing out this.
>>
>>
>>> [snipped]
>>> --
>>> 2.1.2
>>
>> Couple of questions:
>> # In remove_chunk_extent_item, should we also consider "rebuild"
>> chunks now? It can happen that a "rebuild" chunks is a SYSTEM chunk.
>> Should we try to handle it as well?
>
> Not quite sure about the meaning of "rebuild" here.
> The chunk-recovery has the rebuild_chunk_tree() function to rebuild the
> whole chunk tree with
> the good/repaired chunks we found.
>>
>> # Same question for "rebuild_sys_array". Should we also consider
>> "rebuild" chunks?
>
> The chunk-recovery has rebuild_sys_array() to handle SYSTEM chunk too.
>
I meant that with this patch you have added "rebuild_chunks" list:
struct list_head good_chunks;
struct list_head bad_chunks;
struct list_head rebuild_chunks; <--- you added this
struct list_head unrepaired_chunks;


These are chunks that have no block-group record, but we are confident
that we can rebuild the block-group records for these chunks by
scanning all EXTENT_ITEMs in the block-group range and calculating the
"used" value for the block-group. If we fail, we just set
used==block-group size. My question is: should we now consider those
"rebuild_chunks" same as "good_chunks"? I.e., should we also consider
those chunks in the following functions:
- remove_chunk_extent_item: probably no, because we need the
EXTENT_ITEMs to recalculate the "used" value
- rebuild_sys_array: if it happens that a "rebuild_chunk" is also a
SYSTEM chunk, should we add it to the sys_chunk_array too? (In
addition to good_chunks).

Thanks,
Alex.


> Thanks,
> Qu
>
>>
>> Thanks,
>> Alex.
>>
>>
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: rebuild missing block group during chunk recovery if possible

2014-12-23 Thread Alex Lyakas

Hi Qu,

On Thu, Oct 30, 2014 at 4:54 AM, Qu Wenruo  wrote:
> Before the patch, chunk will be considered bad if the corresponding
> block group is missing, even the only uncertain data is the 'used'
> member of the block group.
>
> This patch will try to recalculate the 'used' value of the block group
> and rebuild it.
> So even only chunk item and dev extent item is found, the chunk can be
> recovered.
> Although if extent tree is damanged and needed extent item can't be
> read, the block group's 'used' value will be the block group length, to
> prevent any later write/block reserve damaging the block group.
> In that case, we will prompt user and recommend them to use
> '--init-extent-tree' to rebuild extent tree if possible.
>
> Signed-off-by: Qu Wenruo 
> ---
>  btrfsck.h   |   3 +-
>  chunk-recover.c | 242 
> +---
>  cmds-check.c|  29 ---
>  3 files changed, 234 insertions(+), 40 deletions(-)
>
> diff --git a/btrfsck.h b/btrfsck.h
> index 356c767..7a50648 100644
> --- a/btrfsck.h
> +++ b/btrfsck.h
> @@ -179,5 +179,6 @@ btrfs_new_device_extent_record(struct extent_buffer *leaf,
>  int check_chunks(struct cache_tree *chunk_cache,
>  struct block_group_tree *block_group_cache,
>  struct device_extent_tree *dev_extent_cache,
> -struct list_head *good, struct list_head *bad, int silent);
> +struct list_head *good, struct list_head *bad,
> +struct list_head *rebuild, int silent);
>  #endif
> diff --git a/chunk-recover.c b/chunk-recover.c
> index 6f43066..dbf98b5 100644
> --- a/chunk-recover.c
> +++ b/chunk-recover.c
> @@ -61,6 +61,7 @@ struct recover_control {
>
> struct list_head good_chunks;
> struct list_head bad_chunks;
> +   struct list_head rebuild_chunks;
> struct list_head unrepaired_chunks;
> pthread_mutex_t rc_lock;
>  };
> @@ -203,6 +204,7 @@ static void init_recover_control(struct recover_control 
> *rc, int verbose,
>
> INIT_LIST_HEAD(&rc->good_chunks);
> INIT_LIST_HEAD(&rc->bad_chunks);
> +   INIT_LIST_HEAD(&rc->rebuild_chunks);
> INIT_LIST_HEAD(&rc->unrepaired_chunks);
>
> rc->verbose = verbose;
> @@ -529,22 +531,32 @@ static void print_check_result(struct recover_control 
> *rc)
> return;
>
> printf("CHECK RESULT:\n");
> -   printf("Healthy Chunks:\n");
> +   printf("Recoverable Chunks:\n");
> list_for_each_entry(chunk, &rc->good_chunks, list) {
> print_chunk_info(chunk, "  ");
> good++;
> total++;
> }
> -   printf("Bad Chunks:\n");
> +   list_for_each_entry(chunk, &rc->rebuild_chunks, list) {
> +   print_chunk_info(chunk, "  ");
> +   good++;
> +   total++;
> +   }
> +   list_for_each_entry(chunk, &rc->unrepaired_chunks, list) {
> +   print_chunk_info(chunk, "  ");
> +   good++;
> +   total++;
> +   }
> +   printf("Unrecoverable Chunks:\n");
> list_for_each_entry(chunk, &rc->bad_chunks, list) {
> print_chunk_info(chunk, "  ");
> bad++;
> total++;
> }
> printf("\n");
> -   printf("Total Chunks:\t%d\n", total);
> -   printf("  Heathy:\t%d\n", good);
> -   printf("  Bad:\t%d\n", bad);
> +   printf("Total Chunks:\t\t%d\n", total);
> +   printf("  Recoverable:\t\t%d\n", good);
> +   printf("  Unrecoverable:\t%d\n", bad);
>
> printf("\n");
> printf("Orphan Block Groups:\n");
> @@ -555,6 +567,7 @@ static void print_check_result(struct recover_control *rc)
> printf("Orphan Device Extents:\n");
> list_for_each_entry(devext, &rc->devext.no_chunk_orphans, chunk_list)
> print_device_extent_info(devext, "  ");
> +   printf("\n");
>  }
>
>  static int check_chunk_by_metadata(struct recover_control *rc,
> @@ -938,6 +951,11 @@ static int build_device_maps_by_chunk_records(struct 
> recover_control *rc,
> if (ret)
> return ret;
> }
> +   list_for_each_entry(chunk, &rc->rebuild_chunks, list) {
> +   ret = build_device_map_by_chunk_record(root, chunk);
> +   if (ret)
> +   return ret;
> +   }
> return ret;
>  }
>
> @@ -1168,12 +1186,31 @@ static int __rebuild_device_items(struct 
> btrfs_trans_handle *trans,
> return ret;
>  }
>
> +static int __insert_chunk_item(struct btrfs_trans_handle *trans,
> +   struct chunk_record *chunk_rec,
> +   struct btrfs_root *chunk_root)
> +{
> +   struct btrfs_key key;
> +   struct btrfs_chunk *chunk = NULL;
> +   int ret = 0;
> +
> +   chunk = create_chunk_item(chunk_rec);
> +   if (!chunk)
> +   return -ENOMEM;
> +   key.objec

Re: How btrfs-find-root knows that the block is actually a root?

2014-12-23 Thread Alex Lyakas

Hi Qu,

On Tue, Dec 23, 2014 at 7:27 AM, Qu Wenruo  wrote:
>
>  Original Message 
> Subject: How btrfs-find-root knows that the block is actually a root?
> From: Alex Lyakas 
> To: linux-btrfs 
> Date: 2014年12月22日 22:57
>>
>> Greetings,
>>
>> I am looking at the code of search_iobuf() in
>> btrfs-find-root.c.(3.17.3)  I see that we probe nodesize blocks one by
>> one, and for each block we check:
>> - its owner is what we are looking for
>> - its header->bytenr is what we are looking at currently
>> - its level is not too small
>> - it has valid checksum
>> - it has the desired generation
>>
>> If all those conditions are true, we declare this block as a root and
>> end the program.
>>
>> How do we actually know that it's a root and not a leaf or an
>> intermediate node? What if we are searching for a root of the root
>> tree, which has one node and two leafs (all have the same highest
>> transid), and one of the leafs has "logical" lower than the actual
>> root, i.e., it comes first in our scan. Then we will declare this leaf
>> as a root, won't we? Or somehow the root always has the lowest
>> "logical"?
>
> You can refer to this patch:
> https://patchwork.kernel.org/patch/5285521/
I see that this has not been applied to any of David's branches. Do
you have a repo to look at this code in its entirety?

>
> Your questions are mostly right.
> The best method should be search through all the metadata, and only the
> highest level header for
> a given generation may be the root for that generation.
>
> But that method still has some problems.
> 1) Overwritten old node/leaf
> As btrfs metadata cow happens, old node/leaf may be overwritten and become
> incompletely,
> so above method won't always work as expected.
>
> 2) Corrupted fs
> That will makes everything not work as expected.
> But sadly, when someone needs to use btrfs-find-root, there is a high
> possibility the fs is already corrupted.
>
> 3) Slow speed
> It needs to scan over all the sectors of metadata chunks, it may var from
> megabytese to tegabytes,
> which makes the complete scan impractical.
> So current find-root uses a trade-off, if find a header at the position
> superblock points to, and generation
> matches, then just consider it as the desired root and exit.
I think this is a bit optimistic. What if the root tree has several
leaves having the same generation as the root? Then we might declare a
leaf as a root and exit. But further recovery based on that output
will get us into trouble.

>
>>
>> Also, I am confused by this line:
>> level = h_level;
>> This means that if we encounter a block that "seems good", we will
>> skip all other blocks that have lower level. Is this intended?
>
> This is intended, for case user already know the root's level, so it will
> skip any header whose level is below it.
But this line is performed before the generation check. Let's say that
user did not specify any level (so search_level==0). Then assume we
encounter a block, which has lower generation than what we need, but
higher level. At this point, we do
level = h_level;
and we will skip any blocks lower than this level from now on. What if
the root tree got shirnked (due to subvolume deletion, for example),
and the "good" root has lower level? We will skip it then, and will
not find the root.

Thanks for your comments,
Alex.


>
> Thanks,
> Qu
>>
>>
>> Thanks,
>> Alex.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

How btrfs-find-root knows that the block is actually a root?

2014-12-22 Thread Alex Lyakas

Greetings,

I am looking at the code of search_iobuf() in
btrfs-find-root.c.(3.17.3)  I see that we probe nodesize blocks one by
one, and for each block we check:
- its owner is what we are looking for
- its header->bytenr is what we are looking at currently
- its level is not too small
- it has valid checksum
- it has the desired generation

If all those conditions are true, we declare this block as a root and
end the program.

How do we actually know that it's a root and not a leaf or an
intermediate node? What if we are searching for a root of the root
tree, which has one node and two leafs (all have the same highest
transid), and one of the leafs has "logical" lower than the actual
root, i.e., it comes first in our scan. Then we will declare this leaf
as a root, won't we? Or somehow the root always has the lowest
"logical"?

Also, I am confused by this line:
level = h_level;
This means that if we encounter a block that "seems good", we will
skip all other blocks that have lower level. Is this intended?

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: update commit root on snapshot creation after orphan cleanup

2014-08-02 Thread Alex Lyakas

Hi Filipe,
Thank you for the explanation.
I understand that without your patch we return to user-space after
deleting the orphans, but leaving the transaction open. So user-space
sees the snapshot and can start send. With your patch, we return to
user-space after orphan cleanup has been committed. Unless we crash in
the middle, like you pointed.

I will also look at the new patch.

Thanks!
Alex.




On Thu, Jul 31, 2014 at 3:41 PM, Filipe David Manana  wrote:
> On Mon, Jul 28, 2014 at 6:31 PM, Filipe David Manana  
> wrote:
>> On Sat, Jul 19, 2014 at 8:11 PM, Alex Lyakas
>>  wrote:
>>> Hi Filipe,
>>> It's quite possible I don't fully understand the issue. It seems that
>>> we are creating a read-only snapshot, commit a transaction, and then
>>> go and modify the snapshot once again, by deleting all the
>>> ORPHAN_ITEMs we have in its file tree (btrfs_orphan_cleanup).
>>> Shouldn't all this be part of snapshot creation, so that after we
>>> commit, we have a clean file tree with no orphans there? (not sure if
>>> this makes sense though).
>>>
>>> With your patch we do this additional commit after the cleanup. But
>>> nothing prevents "send" from starting before this additional commit,
>>> correct? And it would still see the orphans through the commit root.
>>> You say that it is not a problem, but I am not sure why (probably I am
>>> missing something here). So for me it looks like your patch closes a
>>> race window significantly (at the cost of an additional commit), but
>>> does not close it fully.
>>
>> Hi Alex,
>>
>> That's right, after the transaction commit finishes, the snapshot will
>> be visible and accessible to user space - so someone may start a send
>> before the orphan cleanup starts. It was ok only for the serialized
>> case (create snapshot, wait for ioctl to return, call send ioctl).
>
> Actually no. If after the 1st transaction commit (the one that creates
> the snapshot and makes it visible to user space) and before the orphan
> cleanup is called another task attempts to use the snapshot for a send
> operation, it will block when doing the snapshot dentry lookup -
> because both tasks acquire the parent inode's mutex (implicitly
> through the vfs and explicitly via the snapshot/subvol ioctl entry
> point).
>
> Nevertheless, it's better to move the commit root switch part to the
> dentry lookup function (as the new patch does), since after the first
> transaction commit and before the second one commits, a reboot might
> happen, and after that we would get into the same issue until the
> first transaction commit happens after the reboot. I'll update the new
> patch's comment.
>
> thanks
>
>>
>>>
>>> But most important: perhaps "send" should look for ORPHAN_ITEMs and
>>> treat those inodes as "deleted"?
>>
>> There are other cases were orphans can exist, like for file truncates
>> for example, where ignoring the inode wouldn't be very correct.
>> Tried that approach initially, but it's actually more complex to
>> implement and adds some additional overhead (tree searches - and the
>> orphan items are normally too far from the inode items, due to a very
>> high objectid (-5ULL)).
>>
>> I've reworked this with a different approach and CC'ed you
>> (https://patchwork.kernel.org/patch/4635471/).
>>
>> thanks
>>
>>>
>>> Thanks,
>>> Alex.
>>>
>>>
>>>
>>> On Tue, Jun 3, 2014 at 2:41 PM, Filipe David Borba Manana
>>>  wrote:
>>>> On snapshot creation (either writable or read-only), we do orphan cleanup
>>>> against the root of the snapshot. If the cleanup did remove any orphans,
>>>> then the current root node will be different from the commit root node
>>>> until the next transaction commit happens.
>>>>
>>>> A send operation always uses the commit root of a snapshot - this means
>>>> it will see the orphans if it starts computing the send stream before the
>>>> next transaction commit happens (triggered by a timer or sync() for .e.g),
>>>> which is when the commit root gets assigned a reference to current root,
>>>> where the orphans are not visible anymore. The consequence of send seeing
>>>> the orphans is explained below.
>>>>
>>>> For example:
>>>>
>>>> mkfs.btrfs -f /dev/sdd
>>>> mount -o commit=999 /dev/sdd /mnt
>>>>
>>>> # open a file with O_

Re: [PATCH] Btrfs: update commit root on snapshot creation after orphan cleanup

2014-07-19 Thread Alex Lyakas

Hi Filipe,
It's quite possible I don't fully understand the issue. It seems that
we are creating a read-only snapshot, commit a transaction, and then
go and modify the snapshot once again, by deleting all the
ORPHAN_ITEMs we have in its file tree (btrfs_orphan_cleanup).
Shouldn't all this be part of snapshot creation, so that after we
commit, we have a clean file tree with no orphans there? (not sure if
this makes sense though).

With your patch we do this additional commit after the cleanup. But
nothing prevents "send" from starting before this additional commit,
correct? And it would still see the orphans through the commit root.
You say that it is not a problem, but I am not sure why (probably I am
missing something here). So for me it looks like your patch closes a
race window significantly (at the cost of an additional commit), but
does not close it fully.

But most important: perhaps "send" should look for ORPHAN_ITEMs and
treat those inodes as "deleted"?

Thanks,
Alex.



On Tue, Jun 3, 2014 at 2:41 PM, Filipe David Borba Manana
 wrote:
> On snapshot creation (either writable or read-only), we do orphan cleanup
> against the root of the snapshot. If the cleanup did remove any orphans,
> then the current root node will be different from the commit root node
> until the next transaction commit happens.
>
> A send operation always uses the commit root of a snapshot - this means
> it will see the orphans if it starts computing the send stream before the
> next transaction commit happens (triggered by a timer or sync() for .e.g),
> which is when the commit root gets assigned a reference to current root,
> where the orphans are not visible anymore. The consequence of send seeing
> the orphans is explained below.
>
> For example:
>
> mkfs.btrfs -f /dev/sdd
> mount -o commit=999 /dev/sdd /mnt
>
> # open a file with O_TMPFILE and leave it open
> # write some data to the file
> btrfs subvolume snapshot -r /mnt /mnt/snap1
>
> btrfs send /mnt/snap1 -f /tmp/send.data
>
> The send operation will fail with the following error:
>
> ERROR: send ioctl failed with -116: Stale file handle
>
> What happens here is that our snapshot has an orphan inode still visible
> through the commit root, that corresponds to the tmpfile. However send
> will attempt to call inode.c:btrfs_iget(), with the goal of reading the
> file's data, which will return -ESTALE because it will use the current
> root (and not the commit root) of the snapshot.
>
> Of course, there are other cases where we can get orphans, but this
> example using a tmpfile makes it much easier to reproduce the issue.
>
> Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if
> the commit root is different from the current root, just commit the
> transaction associated with the snapshot's root (if it exists), so that
> a send will not see any orphans that don't exist anymore. This also
> guarantees a send will always see the same content regardless of whether
> a transaction commit happened already before the send was requested and
> after the orphan cleanup (meaning the commit root and current roots are
> the same) or it hasn't happened yet (commit and current roots are
> different).
>
> Signed-off-by: Filipe David Borba Manana 
> ---
>  fs/btrfs/ioctl.c | 29 +
>  1 file changed, 29 insertions(+)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 95194a9..6680ad9 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -712,6 +712,35 @@ static int create_snapshot(struct btrfs_root *root, 
> struct inode *dir,
> if (ret)
> goto fail;
>
> +   /*
> +* If orphan cleanup did remove any orphans, it means the tree was
> +* modified and therefore the commit root is not the same as the
> +* current root anymore. This is a problem, because send uses the
> +* commit root and therefore can see inode items that don't exist
> +* in the current root anymore, and for example make calls to
> +* btrfs_iget, which will do tree lookups based on the current root
> +* and not on the commit root. Those lookups will fail, returning a
> +* -ESTALE error, and making send fail with that error. So make sure
> +* a send does not see any orphans we have just removed, and that it
> +* will see the same inodes regardless of whether a transaction
> +* commit happened before it started (meaning that the commit root
> +* will be the same as the current root) or not.
> +*/
> +   if (readonly && pending_snapshot->snap->node !=
> +   pending_snapshot->snap->commit_root) {
> +   trans = btrfs_join_transaction(pending_snapshot->snap);
> +   if (IS_ERR(trans) && PTR_ERR(trans) != -ENOENT) {
> +   ret = PTR_ERR(trans);
> +   goto fail;
> +   }
> +   if (!IS_ERR(trans)) {
> +

Re: safe/necessary to balance system chunks?

2014-06-19 Thread Alex Lyakas

On Fri, Apr 25, 2014 at 10:14 PM, Hugo Mills  wrote:
> On Fri, Apr 25, 2014 at 02:12:17PM -0400, Austin S Hemmelgarn wrote:
>> On 2014-04-25 13:24, Chris Murphy wrote:
>> >
>> > On Apr 25, 2014, at 8:57 AM, Steve Leung  wrote:
>> >
>> >>
>> >> Hi list,
>> >>
>> >> I've got a 3-device RAID1 btrfs filesystem that started out life as 
>> >> single-device.
>> >>
>> >> btrfs fi df:
>> >>
>> >> Data, RAID1: total=1.31TiB, used=1.07TiB
>> >> System, RAID1: total=32.00MiB, used=224.00KiB
>> >> System, DUP: total=32.00MiB, used=32.00KiB
>> >> System, single: total=4.00MiB, used=0.00
>> >> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>> >>
>> >> This still lists some system chunks as DUP, and not as RAID1.  Does this 
>> >> mean that if one device were to fail, some system chunks would be 
>> >> unrecoverable?  How bad would that be?
>> >
>> > Since it's "system" type, it might mean the whole volume is toast if the 
>> > drive containing those 32KB dies. I'm not sure what kind of information is 
>> > in system chunk type, but I'd expect it's important enough that if 
>> > unavailable that mounting the file system may be difficult or impossible. 
>> > Perhaps btrfs restore would still work?
>> >
>> > Anyway, it's probably a high penalty for losing only 32KB of data.  I 
>> > think this could use some testing to try and reproduce conversions where 
>> > some amount of "system" or "metadata" type chunks are stuck in DUP. This 
>> > has come up before on the list but I'm not sure how it's happening, as 
>> > I've never encountered it.
>> >
>> As far as I understand it, the system chunks are THE root chunk tree for
>> the entire system, that is to say, it's the tree of tree roots that is
>> pointed to by the superblock. (I would love to know if this
>> understanding is wrong).  Thus losing that data almost always means
>> losing the whole filesystem.
>
>From a conversation I had with cmason a while ago, the System
> chunks contain the chunk tree. They're special because *everything* in
> the filesystem -- including the locations of all the trees, including
> the chunk tree and the roots tree -- is positioned in terms of the
> internal virtual address space. Therefore, when starting up the FS,
> you can read the superblock (which is at a known position on each
> device), which tells you the virtual address of the other trees... and
> you still need to find out where that really is.
>
>The superblock has (I think) a list of physical block addresses at
> the end of it (sys_chunk_array), which allows you to find the blocks
> for the chunk tree and work out this mapping, which allows you to find
> everything else. I'm not 100% certain of the actual format of that
> array -- it's declared as u8 [2048], so I'm guessing there's a load of
> casting to something useful going on in the code somewhere.
The format is just a list of pairs:
struct btrfs_disk_key,  struct btrfs_chunk
struct btrfs_disk_key,  struct btrfs_chunk
...

For each SYSTEM block-group (btrfs_chunk), we need one entry in the
sys_chunk_array. During mkfs the first SYSTEM block group is created,
for me its 4MB. So only if the whole chunk tree grows over 4MB, we
need to create an additional SYSTEM block group, and then we need to
have a second entry in the sys_chunk_array. And so on.

Alex.


>
>Hugo.
>
> --
> === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
>   PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
> --- Is it still called an affair if I'm sleeping with my wife ---
> behind her lover's back?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Snapshot aware defrag and qgroups thoughts

2014-06-19 Thread Alex Lyakas

Hi Josef,
thanks for the detailed description of how the extent tree works!
When I was digging through that in the past, I made some slides to
remember all the call chains. Maybe somebody finds that useful to
accompany your notes.
https://drive.google.com/file/d/0ByBy89zr3kJNNmM5OG5wXzQ3LUE/edit?usp=sharing

Thanks,
Alex.


On Mon, Apr 21, 2014 at 5:55 PM, Josef Bacik  wrote:
> We have a big problem, but it involves a lot of moving parts, so I'm going
> to
> explain all of the parts, and then the problem, and then what I am doing to
> fix
> the problem.  I want you guys to check my work to make sure I'm not missing
> something so when I come back from paternity leave in a few weeks I can just
> sit
> down and finish this work out.
>
> = Extent refs ===
>
> This is basically how extent refs work.  You have either
>
> key.objectid = bytenr;
> key.type = BTRFS_EXTENT_ITEM_KEY;
> key.offset = length;
>
> or you have
>
> key.objectid = bytenr;
> key.type = BTRFS_METADATA_ITEM_KEY;
> key.offset = level of the metadata block;
>
> in the case of skinny metadata.  Then you have the extent item which
> describes
> the number of refs and such, followed by 1 or more inline refs.  All you
> need
> to know for this problem is how I'm going to describe them.  What I call a
> "normal ref" or a "full ref" is a reference that has the actual root
> information
> in the ref.  What I call a "shared ref" is one where we only know the tree
> block
> that owns the particular ref.  So how does this work in practice?
>
> 1) Normal allocation - metadata:  We allocate a tree block as we add new
> items
> to a tree.  We know that this root owns this tree block so we create a
> normal
> ref with the root objectid in the extent ref.  We also set the owner of the
> block itself to our objectid.  This is important to keep in mind.
>
> 2) Normal allocaiton - data: We allocate some data for a given fs tree and
> we
> add a extent ref with the root objectid of the tree we are in, the inode
> number
> and the logical offset into the inode for this inode.
>
> 3) Splitting a data extent: We write to the middle of an existing extent.
> We
> will split this extent into two BTRFS_EXTENT_DATA_KEY items and the increase
> the
> ref count of the original extent by 1.  This means we look up the extent ref
> for
> root->objectid, inode number and the _original_ inode offset.  We don't
> create
> another extent ref, this is important to keep in mind.
>
> = btrfs_copy_root/update_ref_for_cow/btrfs_inc_ref/btrfs_dec_ref =
>
> But Josef, didn't you say there were shared refs?  Why yes I did, but I need
> to
> explain it in context of the people who actually do the dirty work. We'll
> start
> with the easy case
>
> 1) btrfs_copy_root - where snapshots start:  When we make a snapshot we call
> this function, which allocates a completely new block with a new root
> objectid
> and then memcpy's the original root we are snapshotting.  Then we call
> btrfs_inc_ref on our new buffer, which will walk all items in that buffer
> and
> add a new normal ref to each of those blocks for our new root.  This is only
> at
> the level below the new root, nothing below that point.
>
> 2) btrfs_inc_ref/btrfs_dec_ref - how we deal with snapshots: These guys are
> responsible for dealing with the particular action we want to make on our
> given
> buffer.  So if we are free'ing our buffer, we need to drop any refs it has
> to
> the blocks it points to.  For level > 0 this means modifying refs for all of
> the
> tree blocks it points to.  For level == 0 this means modifying refs for any
> data
> extents the leaf may point to.
>
> 3) update_ref_for_cow - this is where the magic happens:  This has a few
> different modes of operation, but every operation means we check to see if
> the
> block is shared, which is we see if we have been snapshotted and if we have
> been
> see if this block has changed since we snapshotted.  If it is shared then we
> look up the extent refs and the flags.  If not then we carry on. From here
> we
> have a few options.
>
> 3a) Not shared: Don't do anything, we can do our normal cow operations and
> carry
> on.
>
> 3b) Shared and cowing from the owning root: This is where the
> btrfs_header_owner() is important.  If we owned this block and it is shared
> then
> we know that any of the upper levels won't have a normal ref to anything
> underneath this block, so we need to add a shared ref for anything this
> block
> points to.  So the first thing we do is btrfs_inc_ref(), but we set the full
> backref flag.  This means that when we add refs for everything this block
> points
> to we don't use a root objectid, we use the bytenr of this block. Then we
> set
> BTRFS_BLOCK_FLAG_FULL_BACKREF for the extent flags for this give block.
>
> 3c) Shared and cowing from not the owning root: So if we are cowing down
> from
> the snapshot we need to make sure that any block we own completely ourselves
> has
> normal refs for any blocks it points to.  So

Re: [PATCH v6] Btrfs: fix memory leak of orphan block rsv

2014-06-18 Thread Alex Lyakas

Hi Filipe,
I finally got to debug this deeper. As it turns out, this happens only
if both "nospace_cache" and "clear_cache" are specified. You need to
unmount and mount again to cause this. After mounting, due to
"clear_cache", all the block-groups are marked as BTRFS_DC_CLEAR, and
then cache_save_setup() is called on them (this function is called
only in case of BTRFS_DC_CLEAR). So cache_save_setup() goes ahead and
creates the free-space inode. But then it realizes that it was mounted
with nospace_cache, so it does not put any content in the inode. But
the inode itself gets created. The patch that fixes this for me:


alex@ubuntu-alex:/mnt/work/alex/linux-stable/source$ git diff -U10
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d170412..06f876e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2941,20 +2941,26 @@ again:
goto out;
}

if (IS_ERR(inode)) {
BUG_ON(retries);
retries++;

if (block_group->ro)
goto out_free;

+   /* with nospace_cache avoid creating the free-space inode */
+   if (!btrfs_test_opt(root, SPACE_CACHE)) {
+   dcs = BTRFS_DC_WRITTEN;
+   goto out_free;
+   }
+
ret = create_free_space_inode(root, trans, block_group, path);
if (ret)
goto out_free;
goto again;
}

/* We've already setup this transaction, go ahead and exit */
if (block_group->cache_generation == trans->transid &&
i_size_read(inode)) {
dcs = BTRFS_DC_SETUP;



Thanks,
Alex.


On Wed, Nov 6, 2013 at 3:19 PM, Filipe David Manana  wrote:
> On Mon, Nov 4, 2013 at 12:16 PM, Alex Lyakas
>  wrote:
>> Hi Filipe,
>> any luck with this patch?:)
>
> Hey Alex,
>
> I haven't digged further, but I remember I couldn't reproduce your
> issue (with latest btrfs-next of that day) of getting the free space
> inodes created even when mount option nospace_cache is given.
>
> What kernel were you using?
>
>>
>> Alex.
>>
>> On Wed, Oct 23, 2013 at 5:26 PM, Filipe David Manana  
>> wrote:
>>> On Wed, Oct 23, 2013 at 3:14 PM, Alex Lyakas
>>>  wrote:
>>>> Hello,
>>>>
>>>> On Wed, Oct 23, 2013 at 4:35 PM, Filipe David Manana  
>>>> wrote:
>>>>> On Wed, Oct 23, 2013 at 2:33 PM, Alex Lyakas
>>>>>  wrote:
>>>>>> Hi Filipe,
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 20, 2013 at 2:52 AM, Filipe David Borba Manana
>>>>>>  wrote:
>>>>>>>
>>>>>>> This issue is simple to reproduce and observe if kmemleak is enabled.
>>>>>>> Two simple ways to reproduce it:
>>>>>>>
>>>>>>> ** 1
>>>>>>>
>>>>>>> $ mkfs.btrfs -f /dev/loop0
>>>>>>> $ mount /dev/loop0 /mnt/btrfs
>>>>>>> $ btrfs balance start /mnt/btrfs
>>>>>>> $ umount /mnt/btrfs
>>>>
>>>> So here it seems that the leak can only happen in case the block-group
>>>> has a free-space inode. This is what the orphan item is added for.
>>>> Yes, here kmemleak reports.
>>>> But: if space_cache option is disabled (and nospace_cache) enabled, it
>>>> seems that btrfs still creates the FREE_SPACE inodes, although they
>>>> are empty because in cache_save_setup:
>>>>
>>>> inode = lookup_free_space_inode(root, block_group, path);
>>>> if (IS_ERR(inode) && PTR_ERR(inode) != -ENOENT) {
>>>> ret = PTR_ERR(inode);
>>>> btrfs_release_path(path);
>>>> goto out;
>>>> }
>>>>
>>>> if (IS_ERR(inode)) {
>>>> ...
>>>> ret = create_free_space_inode(root, trans, block_group, path);
>>>>
>>>> and only later it actually sets BTRFS_DC_WRITTEN if space_cache option
>>>> is disabled. Amazing!
>>>> Although this is a different issue, do you know perhaps why these
>>>> empty inodes are needed?
>>>
>>> Don't know if they are needed. But you have a point, it seems odd to
>>> create the free space cache inode if mount option nospace_cache was
>>> supplied. Thanks Alex. Testing the following patch:
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index c43ee8a..eb1b7da 100

Re: snapshot send with parent question

2014-05-31 Thread Alex Lyakas

Michael,
btrfs-send doesn't really know or care how did you manage to get from
a to c. It is able to compare any two RO subvolumes (not necessarily
related by snapshot operations), and produce a stream of commands that
transfer a-content into c-content.

Send assumes that at a receive side, you have a snapshot identical to
"a". Then receive side locates the a-snapshot (by a "received_UUID"
field) and creates a RW snapshot out of it. This snapshot would be
identical to "c", after applying the stream of commands. Then receive
side applies the stream of commands (in strict order), and at the end
sets the RW snapshot to be RO. At this point, this snapshot should be
identical to c.

The stream of commands most probably will not be identical to
operations that you did in order to get from "a" into "c". But it will
transfer "a"-content into "c"-content (leave alone possible bugs),
which is what's important.

Of course, if a and c are related via snapshot operations, then
btrfs-send will be much more efficient, in terms that it will be able
to skip entire btrfs subtrees (look at "btrfs_compare_trees"), thus
avoiding many additional comparisons that some other tool like rsync
would have done.

Thanks,
Alex.

On Sun, Apr 20, 2014 at 1:00 AM, Michael Welsh Duggan  wrote:
> Assume the following scenario:
> There exists a read-only snapshot called a.
> A read-write snapshot called b is created from a, and is then modified.
> A read-only snapshot of b is created, called c.
> A btrfs send is done for c, with a marked as its parent.
>
> Will the send data only contain the differences between a and c?  My
> experiments seem to indicate no, but I have no confidence that I am not
> doing something else correctly.
>
> Also, when a btrfs receive gets a stream containing the differences
> between a (parent) and c, does it only look at the relative pathname
> differences between a and c in order to determine the matching parent on
> the receiving side?
>
> --
> Michael Welsh Duggan
> (m...@md5i.com)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix deadlock with nested trans handles

2014-05-03 Thread Alex Lyakas

Hi Josef,
this problem could not happen when find_free_extent() was receiving a
transaction handle (which was changed in "Btrfs: avoid starting a
transaction in the write path"), correct? Because it would have used
the passed transaction handle to do the chunk allocation, and thus
would not need to do join_transaction/end_transaction leading to
recursive run_delayed_refs call.

Alex.


On Fri, Mar 7, 2014 at 3:01 AM, Josef Bacik  wrote:
> Zach found this deadlock that would happen like this
>
> btrfs_end_transaction <- reduce trans->use_count to 0
>   btrfs_run_delayed_refs
> btrfs_cow_block
>   find_free_extent
> btrfs_start_transaction <- increase trans->use_count to 1
>   allocate chunk
> btrfs_end_transaction <- decrease trans->use_count to 0
>   btrfs_run_delayed_refs
> lock tree block we are cowing above ^^
>
> We need to only decrease trans->use_count if it is above 1, otherwise leave it
> alone.  This will make nested trans be the only ones who decrease their added
> ref, and will let us get rid of the trans->use_count++ hack if we have to 
> commit
> the transaction.  Thanks,
>
> cc: sta...@vger.kernel.org
> Reported-by: Zach Brown 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 14 --
>  1 file changed, 4 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 34cd831..b05bf58 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -683,7 +683,8 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
> int lock = (trans->type != TRANS_JOIN_NOLOCK);
> int err = 0;
>
> -   if (--trans->use_count) {
> +   if (trans->use_count > 1) {
> +   trans->use_count--;
> trans->block_rsv = trans->orig_rsv;
> return 0;
> }
> @@ -731,17 +732,10 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
> }
>
> if (lock && ACCESS_ONCE(cur_trans->state) == TRANS_STATE_BLOCKED) {
> -   if (throttle) {
> -   /*
> -* We may race with somebody else here so end up 
> having
> -* to call end_transaction on ourselves again, so inc
> -* our use_count.
> -*/
> -   trans->use_count++;
> +   if (throttle)
> return btrfs_commit_transaction(trans, root);
> -   } else {
> +   else
> wake_up_process(info->transaction_kthread);
> -   }
> }
>
> if (trans->type & __TRANS_FREEZABLE)
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: abort the transaction when we don't find our extent ref

2014-05-03 Thread Alex Lyakas

Hi Josef,
how about aborting the transaction also in place that you print:
"umm, got %d back from search, was looking for %llu"
You abort in case ret<0, otherwise the code just proceeds with
extent_slot = path->slots[0];
which can't be right in that case.

Thanks,
Alex.

On Mon, Mar 17, 2014 at 3:55 PM, David Sterba  wrote:
> On Fri, Mar 14, 2014 at 04:36:53PM -0400, Josef Bacik wrote:
>> I'm not sure why we weren't aborting here in the first place, it is 
>> obviously a
>> bad time from the fact that we print the leaf and yell loudly about it.  Fix
>> this up, otherwise we panic because our path could be pointing into oblivion.
>> Thanks,
>>
>> Signed-off-by: Josef Bacik 
>> ---
>>  fs/btrfs/extent-tree.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 696f0b6..0015b02 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -5744,6 +5744,8 @@ static int __btrfs_free_extent(struct 
>> btrfs_trans_handle *trans,
>
> Adding context:
>
> 5748 } else if (WARN_ON(ret == -ENOENT)) {
> 5749 btrfs_print_leaf(extent_root, path->nodes[0]);
> 5750 btrfs_err(info,
>
>>   "unable to find ref byte nr %llu parent %llu root %llu 
>>  owner %llu offset %llu",
>>   bytenr, parent, root_objectid, owner_objectid,
>>   owner_offset);
>> + btrfs_abort_transaction(trans, extent_root, ret);
>
> Abort prints stacktrace on it's own and with the WARN_ON above it would
> be noisy and without any extra benefit, so I suggest to remove it.
>
>> + goto out;
>>   } else {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: correctly determine if blocks are shared in btrfs_compare_trees

2014-04-05 Thread Alex Lyakas

Hi Filipe,
Can you please explain more what is the scenario you are worried about.

Let's say we have two FS trees (subvolumes) subv1 and subv2, subv2
being a RO snapshot of subv1. And they have a shared subtree at
logical==X. Now we change subv1, so its subtree is COW'ed and some
other logical address (Y) is being allocated for subtree root. But X
still cannot be reused as long as subv2 exists. That's the essence of
the extent tree providing refcount for each tree/data block in the FS,
no?

Now finally we delete subv2 and block X is freed. So it can be
reallocated as a root of another subtree. And then it might be
snapshotted again and shared as before.
So where do you see a problem?

If we have two FS-tree subtrees starting at the same logical=X, how
can they be different? This means we allocated logical=X again, while
it was still in use, which is very very bad.

Am I missing something here?

Thanks,
Alex.

P.S.: by "logical" I (and hopefully you) mean the extent-tree level
addresses, i.e., if we have a tree block with logical=X, then we also
have an EXTENT_ITEM with key (X, EXTENT_ITEM, nodesize/leafsize).


On Fri, Feb 21, 2014 at 12:15 AM, Filipe David Borba Manana
 wrote:
> Just comparing the pointers (logical disk addresses) of the btree nodes is
> not completely bullet proof, we have to check if their generation numbers
> match too.
>
> It is guaranteed that a COW operation will result in a block with a different
> logical disk address than the original block's address, but over time we can
> reuse that former logical disk address.
>
> For example, creating a 2Gb filesystem on a loop device, and having a script
> running in a loop always updating the access timestamp of a file, resulted in
> the same logical disk address being reused for the same fs btree block in 
> about
> only 4 minutes.
>
> This could make us skip entire subtrees when doing an incremental send (which
> is currently the only user of btrfs_compare_trees). However the odds of 
> getting
> 2 blocks at the same tree level, with the same logical disk address, equal 
> first
> slot keys and different generations, should hopefully be very low.
>
> Signed-off-by: Filipe David Borba Manana 
> ---
>  fs/btrfs/ctree.c |   11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index cbd3a7d..88d1b1e 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -5376,6 +5376,8 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
> int advance_right;
> u64 left_blockptr;
> u64 right_blockptr;
> +   u64 left_gen;
> +   u64 right_gen;
> u64 left_start_ctransid;
> u64 right_start_ctransid;
> u64 ctransid;
> @@ -5640,7 +5642,14 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
> right_blockptr = btrfs_node_blockptr(
> 
> right_path->nodes[right_level],
> 
> right_path->slots[right_level]);
> -   if (left_blockptr == right_blockptr) {
> +   left_gen = btrfs_node_ptr_generation(
> +   left_path->nodes[left_level],
> +   left_path->slots[left_level]);
> +   right_gen = btrfs_node_ptr_generation(
> +   
> right_path->nodes[right_level],
> +   
> right_path->slots[right_level]);
> +   if (left_blockptr == right_blockptr &&
> +   left_gen == right_gen) {
> /*
>  * As we're on a shared block, don't
>  * allow to go deeper.
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: attach delayed ref updates to delayed ref heads

2014-03-30 Thread Alex Lyakas

Hi Josef,
I have a question about update_existing_head_ref() logic. The question
is not specific to the rework that you have done.

You have a code like this:
if (ref->must_insert_reserved) {
/* if the extent was freed and then
 * reallocated before the delayed ref
 * entries were processed, we can end up
 * with an existing head ref without
 * the must_insert_reserved flag set.
 * Set it again here
 */
existing_ref->must_insert_reserved = ref->must_insert_reserved;
/*
 * update the num_bytes so we make sure the accounting
 * is done correctly
 */
existing->num_bytes = update->num_bytes;
}

How can it happen that you have a delayed_ref head for a particular
bytenr, and then somebody wants to add a ref head for the same bytenr
with must_insert_reserved=true? How could he have possibly allocated
the same bytenr from the free-space cache?
I know that when extent is freed by __btrfs_free_extent(), it calls
update_block_groups(), which pins down the extent. So this extent will
be dropped into free-space-cache only on transaction commit, when all
delayed refs have been processed already.

The only close case that I see is in btrfs_free_tree_block(), where it
adds a BTRFS_DROP_DELAYED_REF, and then if check_ref_cleanup()==0 and
BTRFS_HEADER_FLAG_WRITTEN is not set, it drops the extent directly
into free-space cache. However, check_ref_cleanup() would have deleted
the ref head, so we would not have found an existing ref head.

Can you pls give a clue on this?

Thanks!
Alex.

On Thu, Jan 23, 2014 at 5:28 PM, Josef Bacik  wrote:
> Currently we have two rb-trees, one for delayed ref heads and one for all of 
> the
> delayed refs, including the delayed ref heads.  When we process the delayed 
> refs
> we have to hold onto the delayed ref lock for all of the selecting and merging
> and such, which results in quite a bit of lock contention.  This was solved by
> having a waitqueue and only one flusher at a time, however this hurts if we 
> get
> a lot of delayed refs queued up.
>
> So instead just have an rb tree for the delayed ref heads, and then attach the
> delayed ref updates to an rb tree that is per delayed ref head.  Then we only
> need to take the delayed ref lock when adding new delayed refs and when
> selecting a delayed ref head to process, all the rest of the time we deal 
> with a
> per delayed ref head lock which will be much less contentious.
>
> The locking rules for this get a little more complicated since we have to lock
> up to 3 things to properly process delayed refs, but I will address that 
> problem
> later.  For now this passes all of xfstests and my overnight stress tests.
> Thanks,
>
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/backref.c |  23 ++--
>  fs/btrfs/delayed-ref.c | 223 +-
>  fs/btrfs/delayed-ref.h |  23 ++--
>  fs/btrfs/disk-io.c |  79 ++--
>  fs/btrfs/extent-tree.c | 317 
> -
>  fs/btrfs/transaction.c |   7 +-
>  6 files changed, 267 insertions(+), 405 deletions(-)
>
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 835b6c9..34a8952 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -538,14 +538,13 @@ static int __add_delayed_refs(struct 
> btrfs_delayed_ref_head *head, u64 seq,
> if (extent_op && extent_op->update_key)
> btrfs_disk_key_to_cpu(&op_key, &extent_op->key);
>
> -   while ((n = rb_prev(n))) {
> +   spin_lock(&head->lock);
> +   n = rb_first(&head->ref_root);
> +   while (n) {
> struct btrfs_delayed_ref_node *node;
> node = rb_entry(n, struct btrfs_delayed_ref_node,
> rb_node);
> -   if (node->bytenr != head->node.bytenr)
> -   break;
> -   WARN_ON(node->is_head);
> -
> +   n = rb_next(n);
> if (node->seq > seq)
> continue;
>
> @@ -612,10 +611,10 @@ static int __add_delayed_refs(struct 
> btrfs_delayed_ref_head *head, u64 seq,
> WARN_ON(1);
> }
> if (ret)
> -   return ret;
> +   break;
> }
> -
> -   return 0;
> +   spin_unlock(&head->lock);
> +   return ret;
>  }
>
>  /*
> @@ -882,15 +881,15 @@ again:
> btrfs_put_delayed_ref(&head->node);
> goto again;
> }
> +   spin_unlock(&delayed_refs->lock);
> ret = __add_delayed_refs(head, time_seq,
>  &prefs_delayed);
> mutex_unlock(&head->mutex);
> -   if (ret) {
> -   spin_unlock(&delayed_refs->lock);
> +   if (ret)

Re: [PATCH] Btrfs: fix memory leak in btrfs_create_tree()

2014-03-27 Thread Alex Lyakas

Hi Tsutomu Itoh,

On Thu, Mar 21, 2013 at 6:32 AM, Tsutomu Itoh  wrote:
> We should free leaf and root before returning from the error
> handling code.
>
> Signed-off-by: Tsutomu Itoh 
> ---
>  fs/btrfs/disk-io.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 7d84651..b1b5baa 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1291,6 +1291,7 @@ struct btrfs_root *btrfs_create_tree(struct 
> btrfs_trans_handle *trans,
>   0, objectid, NULL, 0, 0, 0);
> if (IS_ERR(leaf)) {
> ret = PTR_ERR(leaf);
> +   leaf = NULL;
> goto fail;
> }
>
> @@ -1334,11 +1335,16 @@ struct btrfs_root *btrfs_create_tree(struct 
> btrfs_trans_handle *trans,
>
> btrfs_tree_unlock(leaf);
>
> +   return root;
> +
>  fail:
> -   if (ret)
> -   return ERR_PTR(ret);
> +   if (leaf) {
> +   btrfs_tree_unlock(leaf);
> +   free_extent_buffer(leaf);
I believe this is not enough. Few lines above, another reference on
the root is taken by
root->commit_root = btrfs_root_node(root);

So I believe the proper fix would be:
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d9698fd..260af79 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1354,10 +1354,10 @@ struct btrfs_root *btrfs_create_tree(struct
btrfs_trans_handle *trans,
return root;

 fail:
-   if (leaf) {
+   if (leaf)
btrfs_tree_unlock(leaf);
-   free_extent_buffer(leaf);
-   }
+   free_extent_buffer(root->node);
+   free_extent_buffer(root->commit_root);
kfree(root);

return ERR_PTR(ret);



Thanks,
Alex.



> +   }
> +   kfree(root);
>
> -   return root;
> +   return ERR_PTR(ret);
>  }
>
>  static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-03-08 Thread Alex Lyakas

Thanks, Miao,
so the problem is that cow_file_range() joins transaction, allocates
space through btrfs_reserve_extent(), then detaches from transaction.
And then btrfs_finish_ordered_io() joins transaction again, adds a
delayed ref and detaches from transaction. So if between these two,
the transaction commits and we crash, then yes, the allocation is
lost.

Alex.


On Tue, Mar 4, 2014 at 8:04 AM, Miao Xie  wrote:
> On  sat, 1 Mar 2014 20:05:01 +0200, Alex Lyakas wrote:
>> Hi Miao,
>>
>> On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie  wrote:
>>> When we mounted the filesystem after the crash, we got the following
>>> message:
>>>   BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
>>> space
>>>   BTRFS error (device xxx): failed to load free space cache for block group 
>>> 4315938816
>>>
>>> It is because we didn't update the metadata of the allocated space until
>>> the file data was written into the disk. During this time, there was no
>>> information about the allocated spaces in either the extent tree nor the
>>> free space cache. when we wrote out the free space cache at this time, those
>>> spaces were lost.
>> Can you please clarify more about the problem?
>> So I understand that we allocate something from the free space cache.
>> So after that, the free space cache does not account for this extent
>> anymore. On the other hand the extent tree also does not account for
>> it (yet). We need to add a delayed reference, which will be run at
>> transaction commit and update the extent tree. But free-space cache is
>> also written out during transaction commit. So how the issue happens?
>> Can you perhaps post a flow of events?
>
> Task1   Task2
> btrfs_writepages()
>   alloc data space
> (The allocated space was
>  removed from the free
>  space cache)
>   submit_bio()
> btrfs_commit_transaction()
>   write out space cache
>   ...
>   commit transaction completed
> system crash
>  (end_io() wasn't executed)
>
> The system crashed before end_io was executed, so no file references to the
> allocated space, and the extent tree also does not account for it. That space
> was lost.
>
> Thanks
> Miao
>>
>> Thanks.
>> Alex.
>>
>>
>>>
>>> In ordered to fix this problem, I use a state tree for every block group
>>> to record those allocated spaces. We record the information when they are
>>> allocated, and clean up the information after the metadata update. Besides
>>> that, we also introduce a read-write semaphore to avoid the race between
>>> the allocation and the free space cache write out.
>>>
>>> Only data block groups had this problem, so the above change is just
>>> for data space allocation.
>>>
>>> Signed-off-by: Miao Xie 
>>> ---
>>>  fs/btrfs/ctree.h| 15 ++-
>>>  fs/btrfs/disk-io.c  |  2 +-
>>>  fs/btrfs/extent-tree.c  | 24 
>>>  fs/btrfs/free-space-cache.c | 42 ++
>>>  fs/btrfs/inode.c| 42 +++---
>>>  5 files changed, 108 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 1667c9a..f58e1f7 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
>>> /* free space cache stuff */
>>> struct btrfs_free_space_ctl *free_space_ctl;
>>>
>>> +   /*
>>> +* It is used to record the extents that are allocated for
>>> +* the data, but don/t update its metadata.
>>> +*/
>>> +   struct extent_io_tree pinned_extents;
>>> +
>>> /* block group cache stuff */
>>> struct rb_node cache_node;
>>>
>>> @@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
>>>  */
>>> struct list_head space_info;
>>>
>>> +   /*
>>> +* It is just used for the delayed data space allocation
>>> +* because only the data space allocation can be done during
>>> +* we write out the free space cache.
>>>

Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-03-01 Thread Alex Lyakas

Hi Miao,

On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie  wrote:
> When we mounted the filesystem after the crash, we got the following
> message:
>   BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
> space
>   BTRFS error (device xxx): failed to load free space cache for block group 
> 4315938816
>
> It is because we didn't update the metadata of the allocated space until
> the file data was written into the disk. During this time, there was no
> information about the allocated spaces in either the extent tree nor the
> free space cache. when we wrote out the free space cache at this time, those
> spaces were lost.
Can you please clarify more about the problem?
So I understand that we allocate something from the free space cache.
So after that, the free space cache does not account for this extent
anymore. On the other hand the extent tree also does not account for
it (yet). We need to add a delayed reference, which will be run at
transaction commit and update the extent tree. But free-space cache is
also written out during transaction commit. So how the issue happens?
Can you perhaps post a flow of events?

Thanks.
Alex.


>
> In ordered to fix this problem, I use a state tree for every block group
> to record those allocated spaces. We record the information when they are
> allocated, and clean up the information after the metadata update. Besides
> that, we also introduce a read-write semaphore to avoid the race between
> the allocation and the free space cache write out.
>
> Only data block groups had this problem, so the above change is just
> for data space allocation.
>
> Signed-off-by: Miao Xie 
> ---
>  fs/btrfs/ctree.h| 15 ++-
>  fs/btrfs/disk-io.c  |  2 +-
>  fs/btrfs/extent-tree.c  | 24 
>  fs/btrfs/free-space-cache.c | 42 ++
>  fs/btrfs/inode.c| 42 +++---
>  5 files changed, 108 insertions(+), 17 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1667c9a..f58e1f7 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
> /* free space cache stuff */
> struct btrfs_free_space_ctl *free_space_ctl;
>
> +   /*
> +* It is used to record the extents that are allocated for
> +* the data, but don/t update its metadata.
> +*/
> +   struct extent_io_tree pinned_extents;
> +
> /* block group cache stuff */
> struct rb_node cache_node;
>
> @@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
>  */
> struct list_head space_info;
>
> +   /*
> +* It is just used for the delayed data space allocation
> +* because only the data space allocation can be done during
> +* we write out the free space cache.
> +*/
> +   struct rw_semaphore data_rwsem;
> +
> struct btrfs_space_info *data_sinfo;
>
> struct reloc_control *reloc_ctl;
> @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct 
> btrfs_trans_handle *trans,
>struct btrfs_key *ins);
>  int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
>  u64 min_alloc_size, u64 empty_size, u64 hint_byte,
> -struct btrfs_key *ins, int is_data);
> +struct btrfs_key *ins, int is_data, bool need_pin);
>  int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>   struct extent_buffer *buf, int full_backref, int for_cow);
>  int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 8072cfa..426b558 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb,
> fs_info->pinned_extents = &fs_info->freed_extents[0];
> fs_info->do_barriers = 1;
>
> -
> mutex_init(&fs_info->ordered_operations_mutex);
> mutex_init(&fs_info->ordered_extent_flush_mutex);
> mutex_init(&fs_info->tree_log_mutex);
> @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb,
> init_rwsem(&fs_info->extent_commit_sem);
> init_rwsem(&fs_info->cleanup_work_sem);
> init_rwsem(&fs_info->subvol_sem);
> +   init_rwsem(&fs_info->data_rwsem);
> sema_init(&fs_info->uuid_tree_rescan_sem, 1);
> fs_info->dev_replace.lock_owner = 0;
> atomic_set(&fs_info->dev_replace.nesting_level, 0);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3664cfb..7b07876 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6173,7 +6173,7 @@ enum btrfs_loop_type {
>  static noinline int find_free_extent(struct btrfs_root *orig_root,
>  u64 num_bytes, u64 empty_size,
>

Re: [PATCH] Btrfs: fix a deadlock on chunk mutex

2014-02-18 Thread Alex Lyakas

Hi Josef,
is this the commit to look at:
6df9a95e63395f595d0d1eb5d561dd6c91c40270 Btrfs: make the chunk
allocator completely tree lockless

or some other commits are also relevant?

Alex.


On Tue, Feb 18, 2014 at 6:06 PM, Josef Bacik  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
>
>
> On 02/18/2014 10:47 AM, Alex Lyakas wrote:
>> Hello Josef,
>>
>> On Tue, Dec 18, 2012 at 3:52 PM, Josef Bacik 
>> wrote:
>>> On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
>>>> An user reported that he has hit an annoying deadlock while
>>>> playing with ceph based on btrfs.
>>>>
>>>> Current updating device tree requires space from METADATA
>>>> chunk, so we -may- need to do a recursive chunk allocation when
>>>> adding/updating dev extent, that is where the deadlock comes
>>>> from.
>>>>
>>>> If we use SYSTEM metadata to update device tree, we can avoid
>>>> the recursive stuff.
>>>>
>>>
>>> This is going to cause us to allocate much more system chunks
>>> than we used to which could land us in trouble.  Instead let's
>>> just keep us from re-entering if we're already allocating a
>>> chunk.  We do the chunk allocation when we don't have enough
>>> space for a cluster, but we'll likely have plenty of space to
>>> make an allocation.  Can you give this patch a try Jim and see if
>>> it fixes your problem? Thanks,
>>>
>>> Josef
>>>
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index e152809..59df5e7 100644 --- a/fs/btrfs/extent-tree.c +++
>>> b/fs/btrfs/extent-tree.c @@ -3564,6 +3564,10 @@ static int
>>> do_chunk_alloc(struct btrfs_trans_handle *trans, int
>>> wait_for_alloc = 0; int ret = 0;
>>>
>>> +   /* Don't re-enter if we're already allocating a chunk */
>>> +   if (trans->allocating_chunk) +   return
>>> -ENOSPC; + space_info = __find_space_info(extent_root->fs_info,
>>> flags); if (!space_info) { ret =
>>> update_space_info(extent_root->fs_info, flags, @@ -3606,6 +3610,8
>>> @@ again: goto again; }
>>>
>>> +   trans->allocating_chunk = true; + /* * If we have mixed
>>> data/metadata chunks we want to make sure we keep * allocating
>>> mixed chunks instead of individual chunks. @@ -3632,6 +3638,7 @@
>>> again: check_system_chunk(trans, extent_root, flags);
>>>
>>> ret = btrfs_alloc_chunk(trans, extent_root, flags); +
>>> trans->allocating_chunk = false; if (ret < 0 && ret != -ENOSPC)
>>> goto out;
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index e6509b9..47ad8be 100644 --- a/fs/btrfs/transaction.c +++
>>> b/fs/btrfs/transaction.c @@ -388,6 +388,7 @@ again:
>>> h->qgroup_reserved = qgroup_reserved; h->delayed_ref_elem.seq =
>>> 0; h->type = type; +   h->allocating_chunk = false;
>>> INIT_LIST_HEAD(&h->qgroup_ref_list);
>>> INIT_LIST_HEAD(&h->new_bgs);
>>>
>>> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
>>> index 0e8aa1e..69700f7 100644 --- a/fs/btrfs/transaction.h +++
>>> b/fs/btrfs/transaction.h @@ -68,6 +68,7 @@ struct
>>> btrfs_trans_handle { struct btrfs_block_rsv *orig_rsv; short
>>> aborted; short adding_csums; +   bool allocating_chunk; enum
>>> btrfs_trans_type type; /* * this root is only needed to validate
>>> that the root passed to
>>
>> I hit this problem in a following scenario: - a data chunk
>> allocation is triggered, and locks chunk_mutex - the same thread
>> now also wants to allocate a metadata chunk, so it recursively
>> calls do_chunk_alloc, but cannot lock the chunk_mutex => deadlock -
>> btrfs has only one metadata chunk, the one that was initially
>> allocated by mkfs, it has: total_bytes=8388608 bytes_used=8130560
>> bytes_pinned=77824 bytes_reserved=180224 so bytes_used +
>> bytes_pinned + bytes_reserved == total_bytes
>>
>> Your patch would have returned ENOSPC and avoid the deadlock, but
>> there would be a failure to allocate a tree block for metadata. So
>> the transaction would have probably aborted.
>>
>> How such situation should be handled?
>>
>> Idea1: - lock chunk mutex, - if we are allocating a data chunk,
>> check whether the metadata space is below some threshold. If yes,
>> go and allocate a metadata chunk first and then only a data

Re: [PATCH] Btrfs: fix a deadlock on chunk mutex

2014-02-18 Thread Alex Lyakas

Hello Josef,

On Tue, Dec 18, 2012 at 3:52 PM, Josef Bacik  wrote:
> On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
>> An user reported that he has hit an annoying deadlock while playing with
>> ceph based on btrfs.
>>
>> Current updating device tree requires space from METADATA chunk,
>> so we -may- need to do a recursive chunk allocation when adding/updating
>> dev extent, that is where the deadlock comes from.
>>
>> If we use SYSTEM metadata to update device tree, we can avoid the recursive
>> stuff.
>>
>
> This is going to cause us to allocate much more system chunks than we used to
> which could land us in trouble.  Instead let's just keep us from re-entering 
> if
> we're already allocating a chunk.  We do the chunk allocation when we don't 
> have
> enough space for a cluster, but we'll likely have plenty of space to make an
> allocation.  Can you give this patch a try Jim and see if it fixes your 
> problem?
> Thanks,
>
> Josef
>
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e152809..59df5e7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3564,6 +3564,10 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
> *trans,
> int wait_for_alloc = 0;
> int ret = 0;
>
> +   /* Don't re-enter if we're already allocating a chunk */
> +   if (trans->allocating_chunk)
> +   return -ENOSPC;
> +
> space_info = __find_space_info(extent_root->fs_info, flags);
> if (!space_info) {
> ret = update_space_info(extent_root->fs_info, flags,
> @@ -3606,6 +3610,8 @@ again:
> goto again;
> }
>
> +   trans->allocating_chunk = true;
> +
> /*
>  * If we have mixed data/metadata chunks we want to make sure we keep
>  * allocating mixed chunks instead of individual chunks.
> @@ -3632,6 +3638,7 @@ again:
> check_system_chunk(trans, extent_root, flags);
>
> ret = btrfs_alloc_chunk(trans, extent_root, flags);
> +   trans->allocating_chunk = false;
> if (ret < 0 && ret != -ENOSPC)
> goto out;
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index e6509b9..47ad8be 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -388,6 +388,7 @@ again:
> h->qgroup_reserved = qgroup_reserved;
> h->delayed_ref_elem.seq = 0;
> h->type = type;
> +   h->allocating_chunk = false;
> INIT_LIST_HEAD(&h->qgroup_ref_list);
> INIT_LIST_HEAD(&h->new_bgs);
>
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 0e8aa1e..69700f7 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -68,6 +68,7 @@ struct btrfs_trans_handle {
> struct btrfs_block_rsv *orig_rsv;
> short aborted;
> short adding_csums;
> +   bool allocating_chunk;
> enum btrfs_trans_type type;
> /*
>  * this root is only needed to validate that the root passed to

I hit this problem in a following scenario:
- a data chunk allocation is triggered, and locks chunk_mutex
- the same thread now also wants to allocate a metadata chunk, so it
recursively calls do_chunk_alloc, but cannot lock the chunk_mutex =>
deadlock
- btrfs has only one metadata chunk, the one that was initially
allocated by mkfs, it has:
total_bytes=8388608
bytes_used=8130560
bytes_pinned=77824
bytes_reserved=180224
so bytes_used + bytes_pinned + bytes_reserved == total_bytes

Your patch would have returned ENOSPC and avoid the deadlock, but
there would be a failure to allocate a tree block for metadata. So the
transaction would have probably aborted.

How such situation should be handled?

Idea1:
- lock chunk mutex,
- if we are allocating a data chunk, check whether the metadata space
is below some threshold. If yes, go and allocate a metadata chunk
first and then only a data chunk.

Idea2:
- check if we are the same thread that already locked the chunk mutex.
If yes, allow recursive call but don't attempt to lock/unlock the
chunk_mutex this time

Or some other way?

Thanks!
Alex.






> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl

2014-02-15 Thread Alex Lyakas

Hello Hugo,

Is this issue specific to the receive ioctl?

Because what you are describing applies to any user-kernel ioctl-based
interface, when you compile the user-space as 32-bit, which the kernel
space has been compiled as 64-bit. For that purpose, I believe, there
exists the "compat_ioctl" callback. It's implementation should do
"thunking", i.e., treat the user-space structure as if it were
compiled for 32-bit, and unpack it properly.

Today, however, btrfs supplies the same callback both for
"unlocked_ioctl" and "compat_ioctl". So we should see the same problem
with all ioctls, if I am not missing anything.

Thanks,
Alex.



On Mon, Jan 6, 2014 at 10:50 AM, Hugo Mills  wrote:
> On Sun, Jan 05, 2014 at 06:26:11PM +, Hugo Mills wrote:
>> On Sun, Jan 05, 2014 at 05:55:27PM +, Hugo Mills wrote:
>> > The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit
>> > and 64-bit systems. This means that it is impossible to use btrfs
>> > receive on a system with a 64-bit kernel and 32-bit userspace, because
>> > the structure size (and hence the ioctl number) is different.
>> >
>> > This patch adds a compatibility structure and ioctl to deal with the
>> > above case.
>>
>>Oops, forgot to mention -- this has been compile tested, but not
>> actually run yet. The machine in question is several miles away and is
>> a production machine (it's my work desktop, and I can't afford much
>> downtime on it).
>
>... And it doesn't even compile properly, now I come to build a
> .deb. I'm still interested in comments about the general approach, but
> the specific patch is a load of balls.
>
>Hugo.
>
>>Hugo.
>>
>> > Signed-off-by: Hugo Mills 
>> > ---
>> >  fs/btrfs/ioctl.c | 95 
>> > +++-
>> >  1 file changed, 87 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> > index 21da576..e186439 100644
>> > --- a/fs/btrfs/ioctl.c
>> > +++ b/fs/btrfs/ioctl.c
>> > @@ -57,6 +57,32 @@
>> >  #include "send.h"
>> >  #include "dev-replace.h"
>> >
>> > +#ifdef CONFIG_64BIT
>> > +/* If we have a 32-bit userspace and 64-bit kernel, then the UAPI
>> > + * structures are incorrect, as the timespec structure from userspace
>> > + * is 4 bytes too small. We define these alternatives here to teach
>> > + * the kernel about the 32-bit struct packing.
>> > + */
>> > +struct btrfs_ioctl_timespec {
>> > +   __u64 sec;
>> > +   __u32 nsec;
>> > +} ((__packed__));
>> > +
>> > +struct btrfs_ioctl_received_subvol_args {
>> > +   charuuid[BTRFS_UUID_SIZE];  /* in */
>> > +   __u64   stransid;   /* in */
>> > +   __u64   rtransid;   /* out */
>> > +   struct btrfs_ioctl_timespec stime; /* in */
>> > +   struct btrfs_ioctl_timespec rtime; /* out */
>> > +   __u64   flags;  /* in */
>> > +   __u64   reserved[16];   /* in */
>> > +} ((__packed__));
>> > +#endif
>> > +
>> > +#define BTRFS_IOC_SET_RECEIVED_SUBVOL_32 _IOWR(BTRFS_IOCTL_MAGIC, 37, \
>> > +   struct btrfs_ioctl_received_subvol_args_32)
>> > +
>> > +
>> >  static int btrfs_clone(struct inode *src, struct inode *inode,
>> >u64 off, u64 olen, u64 olen_aligned, u64 destoff);
>> >
>> > @@ -4313,10 +4339,69 @@ static long btrfs_ioctl_quota_rescan_wait(struct 
>> > file *file, void __user *arg)
>> > return btrfs_qgroup_wait_for_completion(root->fs_info);
>> >  }
>> >
>> > +#ifdef CONFIG_64BIT
>> > +static long btrfs_ioctl_set_received_subvol_32(struct file *file,
>> > +   void __user *arg)
>> > +{
>> > +   struct btrfs_ioctl_received_subvol_args_32 *args32 = NULL;
>> > +   struct btrfs_ioctl_received_subvol_args *args64 = NULL;
>> > +   int ret = 0;
>> > +
>> > +   args32 = memdup_user(arg, sizeof(*args32));
>> > +   if (IS_ERR(args32)) {
>> > +   ret = PTR_ERR(args32);
>> > +   args32 = NULL;
>> > +   goto out;
>> > +   }
>> > +
>> > +   args64 = malloc(sizeof(*args64));
>> > +   if (IS_ERR(args64)) {
>> > +   ret = PTR_ERR(args64);
>> > +   args64 = NULL;
>> > +   goto out;
>> > +   }
>> > +
>> > +   memcpy(args64->uuid, args32->uuid, BTRFS_UUID_SIZE);
>> > +   args64->stransid = args32->stransid;
>> > +   args64->rtransid = args32->rtransid;
>> > +   args64->stime.sec = args32->stime.sec;
>> > +   args64->stime.nsec = args32->stime.nsec;
>> > +   args64->rtime.sec = args32->rtime.sec;
>> > +   args64->rtime.nsec = args32->rtime.nsec;
>> > +   args64->flags = args32->flags;
>> > +
>> > +   ret = _btrfs_ioctl_set_received_subvol(file, args64);
>> > +
>> > +out:
>> > +   kfree(args32);
>> > +   kfree(args64);
>> > +   return ret;
>> > +}
>> > +#endif
>> > +
>> >  static long btrfs_ioctl_set_received_subvol(struct file *file,
>> > void __user *arg)
>> >  {
>> > struct btrfs_ioctl_received_subvol_args *sa = NULL;
>> > +   int ret = 0;
>> > +
>> > +   s

Re: [PATCH] Btrfs: return ENOSPC when target space is full

2014-01-19 Thread Alex Lyakas

Hi Filipe,
I think in the context of do_chunk_alloc(), 0 doesn't mean "success".
0 means "allocation was not attempted". While 1 means "allocation was
attempted and succeeded". -ENOSPC means "allocation was attempted but
failed". Any other errno deserves transaction abort.
Anyways, the callers are ok with "0" and -ENOSPC and re-search for a
free extent in these cases.

Alex.


On Mon, Aug 5, 2013 at 5:25 PM, Filipe David Borba Manana
 wrote:
> In extent-tree.c:do_chunk_alloc(), early on we returned 0 (success)
> when the target space was full and when chunk allocation is needed.
> However, later on in that same function we return ENOSPC if
> btrfs_alloc_chunk() fails (and chunk allocation was needed) and
> set the space's full flag.
>
> This was inconsistent, as -ENOSPC should be returned if the space
> is full and a chunk allocation needs to performed. If the space is
> full but no chunk allocation is needed, just return 0 (success).
>
> Signed-off-by: Filipe David Borba Manana 
> ---
>  fs/btrfs/extent-tree.c |6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e868c35..ef89a66 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3829,8 +3829,12 @@ again:
> if (force < space_info->force_alloc)
> force = space_info->force_alloc;
> if (space_info->full) {
> +   if (should_alloc_chunk(extent_root, space_info, force))
> +   ret = -ENOSPC;
> +   else
> +   ret = 0;
> spin_unlock(&space_info->lock);
> -   return 0;
> +   return ret;
> }
>
> if (!should_alloc_chunk(extent_root, space_info, force)) {
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] Btrfs-progs: avoid using btrfs internal subvolume path to send

2014-01-11 Thread Alex Lyakas

Hi Miguel,

On Sat, Nov 30, 2013 at 1:43 PM, Miguel Negrão
 wrote:
> Em 29-11-2013 16:37, Wang Shilong escreveu:
>> From: Wang Shilong 
>>
>> Steps to reproduce:
>>   # mkfs.btrfs -f /dev/sda
>>   # mount /dev/sda /mnt
>>   # btrfs subvolume create /mnt/foo
>>   # umount /mnt
>>   # mount -o subvol=foo /dev/sda /mnt
>>   # btrfs sub snapshot -r /mnt /mnt/snap
>>   # btrfs send /mnt/snap > /dev/null
>>
>> We will fail to send '/mnt/snap',this is because btrfs send try to
>> open '/mnt/snap' by btrfs internal subvolume path 'foo/snap' rather
>> than relative path based on mounted point, this will return us 'no
>> such file or directory',this is not right, fix it.
>
> I was going to write to the list to report exactly this issue. In my
> case, this happens when the default subvolume has been changed from id 5
> to some other id. I get the error 'no such file or directory'. Currently
> my workaround is to mount the root subvolume with -o subvolid=5 and then
> do the send.
>
> Also, I'd like to ask, are there plans to make the send and receive
> commands resumeable somehow (or perhaps it is already, but couldn't see
> how) ?
I have proposed a patch to address the resumability of send-receive
some time ago in this thread:
http://www.spinics.net/lists/linux-btrfs/msg18180.html

However, this changes the current user-kernel protocol used by "send",
and overall a big change, which is not easy to integrate.

Alex.


>
> best,
> Miguel Negrão
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 04/13] Btrfs: disable qgroups accounting when quata_enable is 0

2013-12-03 Thread Alex Lyakas

And same is also true for "struct tree_mod_elem" sitting in
"fs_info->tree_mod_log". Usually these are also freed up by
btrfs_delayed_refs_qgroup_accounting() calling
btrfs_put_tree_mod_seq(). But not in case of transaction abort.
Some sample kmemleak stacks below:

unreferenced object 0x8800365605a0 (size 128):
  comm "btrfs", pid 21162, jiffies 4295508540 (age 516.080s)
  hex dump (first 32 bytes):
81 04 56 36 00 88 ff ff 00 00 00 00 00 00 00 00  ..V6
00 00 00 00 00 00 00 00 31 05 00 00 00 00 00 00  1...
  backtrace:
[] kmemleak_alloc+0x26/0x50
[] kmem_cache_alloc_trace+0xab/0x160
[] tree_mod_log_insert_key_locked+0x4f/0x140 [btrfs]
[] tree_mod_log_free_eb+0xc2/0xf0 [btrfs]
[] __btrfs_cow_block+0x316/0x520 [btrfs]
[] btrfs_cow_block+0x12e/0x1f0 [btrfs]
[] btrfs_search_slot+0x381/0x830 [btrfs]
[] btrfs_insert_empty_items+0x7c/0x110 [btrfs]
[] insert_with_overflow+0x43/0x170 [btrfs]
[] btrfs_insert_dir_item+0xbf/0x200 [btrfs]
[] create_pending_snapshot+0xbf5/0xd70 [btrfs]
[] create_pending_snapshots+0x169/0x240 [btrfs]
[] btrfs_commit_transaction+0x4aa/0x1080 [btrfs]
[] btrfs_mksubvol.isra.51+0x501/0x5f0 [btrfs]
[] btrfs_ioctl_snap_create_transid+0x10f/0x1b0 [btrfs]
[] btrfs_ioctl_snap_create_v2+0x135/0x190 [btrfs]

unreferenced object 0x88007ab00c60 (size 128):
  comm "btrfs", pid 21162, jiffies 4295508540 (age 516.760s)
  hex dump (first 32 bytes):
10 0e b0 7a 00 88 ff ff 00 00 00 00 00 00 00 00  ...z
00 00 00 00 00 00 00 00 b7 05 00 00 00 00 00 00  
  backtrace:
[] kmemleak_alloc+0x26/0x50
[] kmem_cache_alloc_trace+0xab/0x160
[] tree_mod_log_insert_key_mask.isra.33+0xb4/0x1c0 [btrfs]
[] tree_mod_log_insert_key+0xe/0x10 [btrfs]
[] __btrfs_cow_block+0x2bf/0x520 [btrfs]
[] btrfs_cow_block+0x12e/0x1f0 [btrfs]
[] push_leaf_right+0x133/0x1a0 [btrfs]
[] split_leaf+0x5e1/0x770 [btrfs]
[] btrfs_search_slot+0x780/0x830 [btrfs]
[] btrfs_insert_empty_items+0x7c/0x110 [btrfs]
[] insert_with_overflow+0x43/0x170 [btrfs]
[] btrfs_insert_dir_item+0xbf/0x200 [btrfs]
[] create_pending_snapshot+0xbf5/0xd70 [btrfs]
[] create_pending_snapshots+0x169/0x240 [btrfs]
[] btrfs_commit_transaction+0x4aa/0x1080 [btrfs]
[] btrfs_mksubvol.isra.51+0x501/0x5f0 [btrfs]


unreferenced object 0x88007b024780 (size 32):
  comm "btrfs", pid 26945, jiffies 4295316905 (age 767.060s)
  hex dump (first 32 bytes):
20 47 02 7b 00 88 ff ff 00 4f 02 7b 00 88 ff ff   G.{.O.{
90 86 b6 6a 00 88 ff ff 00 00 00 00 00 00 00 00  ...j
  backtrace:
[] kmemleak_alloc+0x26/0x50
[] kmem_cache_alloc_trace+0xab/0x160
[] btrfs_qgroup_record_ref+0x44/0xd0 [btrfs]
[] btrfs_add_delayed_tree_ref+0x141/0x1f0 [btrfs]
[] btrfs_free_tree_block+0x9d/0x220 [btrfs]
[] __btrfs_cow_block+0x475/0x520 [btrfs]
[] btrfs_cow_block+0x12e/0x1f0 [btrfs]
[] btrfs_search_slot+0x381/0x830 [btrfs]
[] btrfs_insert_empty_items+0x7c/0x110 [btrfs]
[] insert_with_overflow+0x43/0x170 [btrfs]
[] btrfs_insert_dir_item+0xbf/0x200 [btrfs]
[] create_pending_snapshot+0xbf5/0xd50 [btrfs]
[] create_pending_snapshots+0x101/0x1d0 [btrfs]
[] btrfs_commit_transaction+0x4aa/0x1080 [btrfs]
[] btrfs_mksubvol.isra.51+0x501/0x5f0 [btrfs]
[] btrfs_ioctl_snap_create_transid+0x10f/0x1b0 [btrfs]

unreferenced object 0x88007b024210 (size 32):
  comm "btrfs", pid 26945, jiffies 4295316907 (age 767.068s)
  hex dump (first 32 bytes):
e0 44 02 7b 00 88 ff ff 60 46 02 7b 00 88 ff ff  .D.{`F.{
50 e1 41 6a 00 88 ff ff 70 68 96 6b 00 88 ff ff  P.Ajph.k
  backtrace:
[] kmemleak_alloc+0x26/0x50
[] kmem_cache_alloc_trace+0xab/0x160
[] btrfs_qgroup_record_ref+0x44/0xd0 [btrfs]
[] btrfs_add_delayed_tree_ref+0x141/0x1f0 [btrfs]
[] btrfs_alloc_free_block+0x1a4/0x450 [btrfs]
[] __btrfs_cow_block+0x138/0x520 [btrfs]
[] btrfs_cow_block+0x12e/0x1f0 [btrfs]
[] btrfs_search_slot+0x381/0x830 [btrfs]
[] btrfs_insert_empty_items+0x7c/0x110 [btrfs]
[] insert_with_overflow+0x43/0x170 [btrfs]
[] btrfs_insert_dir_item+0xbf/0x200 [btrfs]
[] create_pending_snapshot+0xbf5/0xd50 [btrfs]
[] create_pending_snapshots+0x101/0x1d0 [btrfs]
[] btrfs_commit_transaction+0x4aa/0x1080 [btrfs]
[] btrfs_mksubvol.isra.51+0x501/0x5f0 [btrfs]
[] btrfs_ioctl_snap_create_transid+0x10f/0x1b0 [btrfs]


On Tue, Dec 3, 2013 at 7:13 PM, Alex Lyakas
 wrote:
> Hi Liu, Jan,
>
> What happens to "struct qgroup_update"s sitting in
> trans->qgroup_ref_list in case the transaction aborts? It seems that
> they are not freed.
>
> For example, if we are in btrfs_commit_transaction() and:
> - call create_pending_snapshots()
> - call btrfs_run_delayed_items() and this fails
> So we go to cl

Re: [PATCH v7 04/13] Btrfs: disable qgroups accounting when quata_enable is 0

2013-12-03 Thread Alex Lyakas

Hi Liu, Jan,

What happens to "struct qgroup_update"s sitting in
trans->qgroup_ref_list in case the transaction aborts? It seems that
they are not freed.

For example, if we are in btrfs_commit_transaction() and:
- call create_pending_snapshots()
- call btrfs_run_delayed_items() and this fails
So we go to cleanup_transaction() without calling
btrfs_delayed_refs_qgroup_accounting(), which would have been called
by btrfs_run_delayed_refs().

I receive kmemleak warnings about these thingies not being freed,
although on an older kernel. However, looking at Josef's tree, this
still seems to be the case.

Thanks,
Alex.


On Mon, Oct 14, 2013 at 7:59 AM, Liu Bo  wrote:
> It's unnecessary to do qgroups accounting without enabling quota.
>
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/ctree.c   |2 +-
>  fs/btrfs/delayed-ref.c |   18 ++
>  fs/btrfs/qgroup.c  |3 +++
>  3 files changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 61b5bcd..fb89235 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -407,7 +407,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
>
> tree_mod_log_write_lock(fs_info);
> spin_lock(&fs_info->tree_mod_seq_lock);
> -   if (!elem->seq) {
> +   if (elem && !elem->seq) {
> elem->seq = btrfs_inc_tree_mod_seq_major(fs_info);
> list_add_tail(&elem->list, &fs_info->tree_mod_seq_list);
> }
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9e1a1c9..3ec3d08 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -691,8 +691,13 @@ static noinline void add_delayed_tree_ref(struct 
> btrfs_fs_info *fs_info,
> ref->is_head = 0;
> ref->in_tree = 1;
>
> -   if (need_ref_seq(for_cow, ref_root))
> -   seq = btrfs_get_tree_mod_seq(fs_info, 
> &trans->delayed_ref_elem);
> +   if (need_ref_seq(for_cow, ref_root)) {
> +   struct seq_list *elem = NULL;
> +
> +   if (fs_info->quota_enabled)
> +   elem = &trans->delayed_ref_elem;
> +   seq = btrfs_get_tree_mod_seq(fs_info, elem);
> +   }
> ref->seq = seq;
>
> full_ref = btrfs_delayed_node_to_tree_ref(ref);
> @@ -750,8 +755,13 @@ static noinline void add_delayed_data_ref(struct 
> btrfs_fs_info *fs_info,
> ref->is_head = 0;
> ref->in_tree = 1;
>
> -   if (need_ref_seq(for_cow, ref_root))
> -   seq = btrfs_get_tree_mod_seq(fs_info, 
> &trans->delayed_ref_elem);
> +   if (need_ref_seq(for_cow, ref_root)) {
> +   struct seq_list *elem = NULL;
> +
> +   if (fs_info->quota_enabled)
> +   elem = &trans->delayed_ref_elem;
> +   seq = btrfs_get_tree_mod_seq(fs_info, elem);
> +   }
> ref->seq = seq;
>
> full_ref = btrfs_delayed_node_to_data_ref(ref);
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 4e6ef49..1cb58f9 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1188,6 +1188,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle 
> *trans,
>  {
> struct qgroup_update *u;
>
> +   if (!trans->root->fs_info->quota_enabled)
> +   return 0;
> +
> BUG_ON(!trans->delayed_ref_elem.seq);
> u = kmalloc(sizeof(*u), GFP_NOFS);
> if (!u)
> --
> 1.7.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] btrfs-progs: calculate disk space that a subvol could free

2013-11-28 Thread Alex Lyakas

Hello Anand,
I have sent a similar comment to your email thread in
http://www.spinics.net/lists/linux-btrfs/msg27547.html

I believe this approach of calculating freeable space is incorrect.
Try this:
- create a fresh btrfs
- create a regular file
- write some data into it in such a way, that it was, say 4000
EXTENT_DATA items, so that file tree and extent tree get deep enough
- run btrfs-debug-tree and verify that all EXTENT_ITEMs of this file
(in the extent tree) have refcnt=1
- create a snapshot of the subvolume
- run btrfs-debug-tree again

You will see that most (in my case - all) of EXTENT_ITEMs still have
refcnt=1. Reason for this is as I mentioned in
http://www.spinics.net/lists/linux-btrfs/msg27547.html

But if you delete the subvolume, no space will be freed, because all
these extents are also shared by the snapshot. Although it seems that
your tool will report that all subvolume's space is freeable (based on
refcnt=1).

Can you pls try that experiment and comment on it? Perhaps I am
missing something here?

Thanks!
Alex.



On Thu, Oct 10, 2013 at 6:33 AM, Wang Shilong
 wrote:
> On 10/10/2013 11:35 AM, Anand Jain wrote:
>>
>>
>>  If 'btrfs_file_extent_item' can contain the ref count it would
>>  solve the current problem quite easily.  (problem is that, its
>>  of n * n searches to know data extents with its ref for a given
>>  subvol).
>
> Just considering btrfs_file_extent_item is not enough, because
> we should consider metadata(as i have said before).
>
>>
>>  But what'r the challenge(s) to have ref count in the
>>  btrfs_file_extent_item ? any thoughts ?
>
> Haven't thought a better idea yet.
>
>
>>
>> Thanks, Anand
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6] Btrfs: fix memory leak of orphan block rsv

2013-11-04 Thread Alex Lyakas

Hi Filipe,
any luck with this patch?:)

Alex.

On Wed, Oct 23, 2013 at 5:26 PM, Filipe David Manana  wrote:
> On Wed, Oct 23, 2013 at 3:14 PM, Alex Lyakas
>  wrote:
>> Hello,
>>
>> On Wed, Oct 23, 2013 at 4:35 PM, Filipe David Manana  
>> wrote:
>>> On Wed, Oct 23, 2013 at 2:33 PM, Alex Lyakas
>>>  wrote:
>>>> Hi Filipe,
>>>>
>>>>
>>>> On Tue, Aug 20, 2013 at 2:52 AM, Filipe David Borba Manana
>>>>  wrote:
>>>>>
>>>>> This issue is simple to reproduce and observe if kmemleak is enabled.
>>>>> Two simple ways to reproduce it:
>>>>>
>>>>> ** 1
>>>>>
>>>>> $ mkfs.btrfs -f /dev/loop0
>>>>> $ mount /dev/loop0 /mnt/btrfs
>>>>> $ btrfs balance start /mnt/btrfs
>>>>> $ umount /mnt/btrfs
>>
>> So here it seems that the leak can only happen in case the block-group
>> has a free-space inode. This is what the orphan item is added for.
>> Yes, here kmemleak reports.
>> But: if space_cache option is disabled (and nospace_cache) enabled, it
>> seems that btrfs still creates the FREE_SPACE inodes, although they
>> are empty because in cache_save_setup:
>>
>> inode = lookup_free_space_inode(root, block_group, path);
>> if (IS_ERR(inode) && PTR_ERR(inode) != -ENOENT) {
>> ret = PTR_ERR(inode);
>> btrfs_release_path(path);
>> goto out;
>> }
>>
>> if (IS_ERR(inode)) {
>> ...
>> ret = create_free_space_inode(root, trans, block_group, path);
>>
>> and only later it actually sets BTRFS_DC_WRITTEN if space_cache option
>> is disabled. Amazing!
>> Although this is a different issue, do you know perhaps why these
>> empty inodes are needed?
>
> Don't know if they are needed. But you have a point, it seems odd to
> create the free space cache inode if mount option nospace_cache was
> supplied. Thanks Alex. Testing the following patch:
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c43ee8a..eb1b7da 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3162,6 +3162,9 @@ static int cache_save_setup(struct
> btrfs_block_group_cache *block_group,
> int retries = 0;
> int ret = 0;
>
> +   if (!btrfs_test_opt(root, SPACE_CACHE))
> +   return 0;
> +
> /*
>  * If this block group is smaller than 100 megs don't bother caching 
> the
>  * block group.
>
>
>>
>> Thanks!
>> Alex.
>>
>>
>>
>>>>>
>>>>> ** 2
>>>>>
>>>>> $ mkfs.btrfs -f /dev/loop0
>>>>> $ mount /dev/loop0 /mnt/btrfs
>>>>> $ touch /mnt/btrfs/foobar
>>>>> $ rm -f /mnt/btrfs/foobar
>>>>> $ umount /mnt/btrfs
>>>>
>>>>
>>>> I tried the second repro script on kernel 3.8.13, and kmemleak does
>>>> not report a leak (even if I force the kmemleak scan). I did not try
>>>> the balance-repro script, though. Am I missing something?
>>>
>>> Maybe it's not an issue on 3.8.13 and older releases.
>>> This was on btrfs-next from August 19.
>>>
>>> thanks for testing
>>>
>>>>
>>>> Thanks,
>>>> Alex.
>>>>
>>>>
>>>>>
>>>>>
>>>>> After a while, kmemleak reports the leak:
>>>>>
>>>>> $ cat /sys/kernel/debug/kmemleak
>>>>> unreferenced object 0x880402b13e00 (size 128):
>>>>>   comm "btrfs", pid 19621, jiffies 4341648183 (age 70057.844s)
>>>>>   hex dump (first 32 bytes):
>>>>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>>>>> 00 fc c6 b1 04 88 ff ff 04 00 04 00 ad 4e ad de  .N..
>>>>>   backtrace:
>>>>> [] kmemleak_alloc+0x26/0x50
>>>>> [] kmem_cache_alloc_trace+0xeb/0x1d0
>>>>> [] btrfs_alloc_block_rsv+0x39/0x70 [btrfs]
>>>>> [] btrfs_orphan_add+0x13d/0x1b0 [btrfs]
>>>>> [] btrfs_remove_block_group+0x143/0x500 [btrfs]
>>>>> [] btrfs_relocate_chunk.isra.63+0x618/0x790 [btrfs]
>>>>> [] btrfs_balance+0x8f7/0xe90 [btrfs]
>>>>> [] btrfs_ioctl_balance+0x250/0x550 [btrfs]
>>>>> [] btrfs_ioctl+0xdfa/0x25f0 [btrfs]
>>>>> [] do_vfs_ioctl+0x96/0x57

Re: [PATCH] btrfs-progs: calculate disk space that a subvol could free upon delete

2013-10-26 Thread Alex Lyakas

Thinking about this more, I believe this way of checking for exclusive
data doesn't work. When a snapshot is created, btrfs doesn't go and
explicitly increment refcount on *all* relevant EXTENT_ITEMs in the
extent tree. This way creating a snapshot would take forever for large
subvolumes. Instead, it only does that on EXTENT_ITEMs, which are
pointed by EXTENT_DATAs in the root node of the snapshottted file
tree. For rest of nodes/leafs of a file tree, an "implicit" tree-block
references are added (not sure if "implicit" is the right term) for
top tree blocks only. This is accomplished by _btrfs_mod_ref() code,
called from btrfs_copy_root() during snapshot creation flow. Snapshot
deletion code is the one that is smart enough to properly "unshare"
shared tree blocks with such "implicit" references.

What do you think?

Alex.


On Sat, Oct 26, 2013 at 10:49 PM, Alex Lyakas
 wrote:
> Hi Anand,
>
> 1) so let's say I have a subvolume and a snapshot of this subvolume.
> So in this case, I will see "Sole space = 0" for both of them,
> correct? Because all extents (except inline ones) are shared.
>
> 2) How is this in terms of responsiveness? On a huge subvolume, we
> need to iterate all the EXTENT_DATAs and then lookup their
> EXTENT_ITEMs.
>
> 3) So it's kind of poor man's replacement for quota groups, correct?
>
> I like that it's so easy to check for exclusive data, though:)
>
> Alex.
>
>
> On Fri, Sep 13, 2013 at 6:44 PM, Wang Shilong  
> wrote:
>>
>> Hello Anand,
>>
>>> (This patch is for review and comments only)
>>>
>>> This patch provides a way to know how much space can be
>>> relinquished if when subvol /snapshot is deleted.  With
>>> this sys admin can make better judgments in managing the
>>> filesystem when fs is near full.
>>>
>>
>> I think this is really *helpful* since users can not really know how much
>> space(Exclusive) in a subvolume .
>>
>> Thanks,
>> Wang
>>
>>> as shown below the parameter 'sole space' indicates the size
>>> which is freed when subvol is deleted. (any other better
>>> term for this?, pls suggest).
>>> -
>>> btrfs su show /btrfs/sv1
>>> /btrfs/sv1
>>>   Name:   sv1
>>>   uuid:   b078ba48-d4a5-2f49-ac03-9bd1d56cc768
>>>   Parent uuid:-
>>>   Creation time:  2013-09-13 18:17:32
>>>   Object ID:  257
>>>   Generation (Gen):   18
>>>   Gen at creation:17
>>>   Parent: 5
>>>   Top Level:  5
>>>   Flags:  -
>>>   Sole space: 1.56MiB <
>>>   Snapshot(s):
>>>
>>> btrfs su snap /btrfs/sv1 /btrfs/ss2
>>> Create a snapshot of '/btrfs/sv1' in '/btrfs/ss2'
>>>
>>> btrfs su show /btrfs/sv1
>>> /btrfs/sv1
>>>   Name:   sv1
>>>   uuid:   b078ba48-d4a5-2f49-ac03-9bd1d56cc768
>>>   Parent uuid:-
>>>   Creation time:  2013-09-13 18:17:32
>>>   Object ID:  257
>>>   Generation (Gen):   19
>>>   Gen at creation:17
>>>   Parent: 5
>>>   Top Level:  5
>>>   Flags:  -
>>>   Sole space: 0.00  <-
>>>   Snapshot(s):
>>>   ss2
>>> -
>>>
>>> Signed-off-by: Anand Jain 
>>> ---
>>> cmds-subvolume.c |   5 ++
>>> utils.c  | 154 
>>> +++
>>> utils.h  |   1 +
>>> 3 files changed, 160 insertions(+)
>>>
>>> diff --git a/cmds-subvolume.c b/cmds-subvolume.c
>>> index de246ab..2b02d66 100644
>>> --- a/cmds-subvolume.c
>>> +++ b/cmds-subvolume.c
>>> @@ -809,6 +809,7 @@ static int cmd_subvol_show(int argc, char **argv)
>>>   int fd = -1, mntfd = -1;
>>>   int ret = 1;
>>>   DIR *dirstream1 = NULL, *dirstream2 = NULL;
>>> + u64 freeable_bytes;
>>>
>>>   if (check_argc_exact(argc, 2))
>>>   usage(cmd_subvol_show_usage);
>>> @@ -878,6 +879,8 @@ static int cmd_subvol_show(int argc, char **argv)
>>>   goto out;
>>>   }
>>>
>>

Re: [PATCH] btrfs-progs: calculate disk space that a subvol could free upon delete

2013-10-26 Thread Alex Lyakas

Hi Anand,

1) so let's say I have a subvolume and a snapshot of this subvolume.
So in this case, I will see "Sole space = 0" for both of them,
correct? Because all extents (except inline ones) are shared.

2) How is this in terms of responsiveness? On a huge subvolume, we
need to iterate all the EXTENT_DATAs and then lookup their
EXTENT_ITEMs.

3) So it's kind of poor man's replacement for quota groups, correct?

I like that it's so easy to check for exclusive data, though:)

Alex.


On Fri, Sep 13, 2013 at 6:44 PM, Wang Shilong  wrote:
>
> Hello Anand,
>
>> (This patch is for review and comments only)
>>
>> This patch provides a way to know how much space can be
>> relinquished if when subvol /snapshot is deleted.  With
>> this sys admin can make better judgments in managing the
>> filesystem when fs is near full.
>>
>
> I think this is really *helpful* since users can not really know how much
> space(Exclusive) in a subvolume .
>
> Thanks,
> Wang
>
>> as shown below the parameter 'sole space' indicates the size
>> which is freed when subvol is deleted. (any other better
>> term for this?, pls suggest).
>> -
>> btrfs su show /btrfs/sv1
>> /btrfs/sv1
>>   Name:   sv1
>>   uuid:   b078ba48-d4a5-2f49-ac03-9bd1d56cc768
>>   Parent uuid:-
>>   Creation time:  2013-09-13 18:17:32
>>   Object ID:  257
>>   Generation (Gen):   18
>>   Gen at creation:17
>>   Parent: 5
>>   Top Level:  5
>>   Flags:  -
>>   Sole space: 1.56MiB <
>>   Snapshot(s):
>>
>> btrfs su snap /btrfs/sv1 /btrfs/ss2
>> Create a snapshot of '/btrfs/sv1' in '/btrfs/ss2'
>>
>> btrfs su show /btrfs/sv1
>> /btrfs/sv1
>>   Name:   sv1
>>   uuid:   b078ba48-d4a5-2f49-ac03-9bd1d56cc768
>>   Parent uuid:-
>>   Creation time:  2013-09-13 18:17:32
>>   Object ID:  257
>>   Generation (Gen):   19
>>   Gen at creation:17
>>   Parent: 5
>>   Top Level:  5
>>   Flags:  -
>>   Sole space: 0.00  <-
>>   Snapshot(s):
>>   ss2
>> -
>>
>> Signed-off-by: Anand Jain 
>> ---
>> cmds-subvolume.c |   5 ++
>> utils.c  | 154 
>> +++
>> utils.h  |   1 +
>> 3 files changed, 160 insertions(+)
>>
>> diff --git a/cmds-subvolume.c b/cmds-subvolume.c
>> index de246ab..2b02d66 100644
>> --- a/cmds-subvolume.c
>> +++ b/cmds-subvolume.c
>> @@ -809,6 +809,7 @@ static int cmd_subvol_show(int argc, char **argv)
>>   int fd = -1, mntfd = -1;
>>   int ret = 1;
>>   DIR *dirstream1 = NULL, *dirstream2 = NULL;
>> + u64 freeable_bytes;
>>
>>   if (check_argc_exact(argc, 2))
>>   usage(cmd_subvol_show_usage);
>> @@ -878,6 +879,8 @@ static int cmd_subvol_show(int argc, char **argv)
>>   goto out;
>>   }
>>
>> + freeable_bytes = get_subvol_freeable_bytes(fd);
>> +
>>   ret = 0;
>>   /* print the info */
>>   printf("%s\n", fullpath);
>> @@ -915,6 +918,8 @@ static int cmd_subvol_show(int argc, char **argv)
>>   else
>>   printf("\tFlags: \t\t\t-\n");
>>
>> + printf("\tSole space: \t\t%s\n",
>> + pretty_size(freeable_bytes));
>>   /* print the snapshots of the given subvol if any*/
>>   printf("\tSnapshot(s):\n");
>>   filter_set = btrfs_list_alloc_filter_set();
>> diff --git a/utils.c b/utils.c
>> index 22c3310..f01d580 100644
>> --- a/utils.c
>> +++ b/utils.c
>> @@ -2019,3 +2019,157 @@ int is_dev_excl_op_free(int fd)
>>   ret = ioctl(fd, BTRFS_IOC_CHECK_DEV_EXCL_OPS, NULL);
>>   return ret > 0 ? ret : -errno;
>> }
>> +
>> +/* gets the ref count for given extent
>> + * 0 = didn't find the item
>> + * n = number of references
>> +*/
>> +u64 get_extent_refcnt(int fd, u64 disk_blk)
>> +{
>> + int ret = 0, i, e;
>> + struct btrfs_ioctl_search_args args;
>> + struct btrfs_ioctl_search_key *sk = &args.key;
>> + struct btrfs_ioctl_search_header sh;
>> + unsigned long off = 0;
>> +
>> + memset(&args, 0, sizeof(args));
>> +
>> + sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
>> +
>> + sk->min_type = BTRFS_EXTENT_ITEM_KEY;
>> + sk->max_type = BTRFS_EXTENT_ITEM_KEY;
>> +
>> + sk->min_objectid = disk_blk;
>> + sk->max_objectid = disk_blk;
>> +
>> + sk->max_offset = (u64)-1;
>> + sk->max_transid = (u64)-1;
>> +
>> + while (1) {
>> + sk->nr_items = 4096;
>> +
>> + ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
>> + e = errno;
>> + if (ret < 0) {
>> + fprintf(stderr, "ERROR: search failed - %s\n",
>> + strerror(e));
>> + return 0;
>>

Re: [patch 3/7] btrfs: Add per-super attributes to sysfs

2013-10-26 Thread Alex Lyakas

Hi Jeff,

On Tue, Sep 10, 2013 at 7:24 AM, Jeff Mahoney  wrote:
> This patch adds per-super attributes to sysfs.
>
> It doesn't publish any attributes yet, but does the proper lifetime
> handling as well as the basic infrastructure to add new attributes.
>
> Signed-off-by: Jeff Mahoney 
> ---
>  fs/btrfs/ctree.h |2 +
>  fs/btrfs/super.c |   13 +++-
>  fs/btrfs/sysfs.c |   58 
> +++
>  fs/btrfs/sysfs.h |   19 ++
>  4 files changed, 91 insertions(+), 1 deletion(-)
>
> --- a/fs/btrfs/ctree.h  2013-09-10 00:09:12.990087784 -0400
> +++ b/fs/btrfs/ctree.h  2013-09-10 00:09:35.521794520 -0400
> @@ -3694,6 +3694,8 @@ int btrfs_defrag_leaves(struct btrfs_tra
>  /* sysfs.c */
>  int btrfs_init_sysfs(void);
>  void btrfs_exit_sysfs(void);
> +int btrfs_sysfs_add_one(struct btrfs_fs_info *fs_info);
> +void btrfs_sysfs_remove_one(struct btrfs_fs_info *fs_info);
>
>  /* xattr.c */
>  ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size);
> --- a/fs/btrfs/super.c  2013-09-10 00:09:12.994087730 -0400
> +++ b/fs/btrfs/super.c  2013-09-10 00:09:35.525794464 -0400
> @@ -301,6 +301,8 @@ void __btrfs_panic(struct btrfs_fs_info
>
>  static void btrfs_put_super(struct super_block *sb)
>  {
> +   btrfs_sysfs_remove_one(btrfs_sb(sb));
> +
> (void)close_ctree(btrfs_sb(sb)->tree_root);
> /* FIXME: need to fix VFS to return error? */
> /* AV: return it _where_?  ->put_super() can be triggered by any 
> number
> @@ -1143,8 +1145,17 @@ static struct dentry *btrfs_mount(struct
> }
>
> root = !error ? get_default_root(s, subvol_objectid) : ERR_PTR(error);
> -   if (IS_ERR(root))
> +   if (IS_ERR(root)) {
> deactivate_locked_super(s);
> +   return root;
> +   }
> +
> +   error = btrfs_sysfs_add_one(fs_info);
> +   if (error) {
> +   dput(root);
> +   deactivate_locked_super(s);
> +   return ERR_PTR(error);
> +   }
>
> return root;
>
> --- a/fs/btrfs/sysfs.c  2013-09-10 00:09:13.002087628 -0400
> +++ b/fs/btrfs/sysfs.c  2013-09-10 00:09:49.501616538 -0400
> @@ -61,6 +61,64 @@ static struct attribute *btrfs_supp_feat
> NULL
>  };
>
> +static struct attribute *btrfs_attrs[] = {
> +   NULL,
> +};
> +
> +static void btrfs_fs_info_release(struct kobject *kobj)
> +{
> +   struct btrfs_fs_info *fs_info;
> +   fs_info = container_of(kobj, struct btrfs_fs_info, super_kobj);
> +   complete(&fs_info->kobj_unregister);
> +}
> +
> +static ssize_t btrfs_attr_show(struct kobject *kobj,
> +  struct attribute *attr, char *buf)
> +{
> +   struct btrfs_attr *a = container_of(attr, struct btrfs_attr, attr);
> +   struct btrfs_fs_info *fs_info;
> +   fs_info = container_of(kobj, struct btrfs_fs_info, super_kobj);
> +
> +   return a->show ? a->show(a, fs_info, buf) : 0;
> +}
> +
> +static ssize_t btrfs_attr_store(struct kobject *kobj,
> +   struct attribute *attr,
> +   const char *buf, size_t len)
> +{
> +   struct btrfs_attr *a = container_of(attr, struct btrfs_attr, attr);
> +   struct btrfs_fs_info *fs_info;
> +   fs_info = container_of(kobj, struct btrfs_fs_info, super_kobj);
> +
> +   return a->store ? a->store(a, fs_info, buf, len) : 0;
> +}
> +
> +static const struct sysfs_ops btrfs_attr_ops = {
> +   .show = btrfs_attr_show,
> +   .store = btrfs_attr_store,
> +};
> +
> +static struct kobj_type btrfs_ktype = {
> +   .default_attrs  = btrfs_attrs,
> +   .sysfs_ops  = &btrfs_attr_ops,
> +   .release= btrfs_fs_info_release,
> +};
> +
> +int btrfs_sysfs_add_one(struct btrfs_fs_info *fs_info)
> +{
> +   init_completion(&fs_info->kobj_unregister);
> +   fs_info->super_kobj.kset = btrfs_kset;
> +   return kobject_init_and_add(&fs_info->super_kobj, &btrfs_ktype, NULL,
> +   "%pU", fs_info->fsid);
> +}
> +
> +void btrfs_sysfs_remove_one(struct btrfs_fs_info *fs_info)
> +{
> +   kobject_del(&fs_info->super_kobj);
Is there a reason for this explicit call? The last kobject_put will do
this automatically, no?

> +   kobject_put(&fs_info->super_kobj);
> +   wait_for_completion(&fs_info->kobj_unregister);
> +}
> +
>  static void btrfs_supp_feat_release(struct kobject *kobj)
>  {
> complete(&btrfs_feat->f_kobj_unregister);
> --- a/fs/btrfs/sysfs.h  2013-09-10 00:09:13.002087628 -0400
> +++ b/fs/btrfs/sysfs.h  2013-09-10 00:09:35.525794464 -0400
> @@ -8,6 +8,24 @@ enum btrfs_feature_set {
> FEAT_MAX
>  };
>
> +struct btrfs_attr {
> +   struct attribute attr;
> +   ssize_t (*show)(struct btrfs_attr *, struct btrfs_fs_info *, char *);
> +   ssize_t (*store)(struct btrfs_attr *, struct btrfs_fs_info *,
> +const char *, size_t);
> +};
> +
> +#define __INIT_BTRFS_AT

Re: [PATCH v6] Btrfs: fix memory leak of orphan block rsv

2013-10-23 Thread Alex Lyakas

Hello,

On Wed, Oct 23, 2013 at 4:35 PM, Filipe David Manana  wrote:
> On Wed, Oct 23, 2013 at 2:33 PM, Alex Lyakas
>  wrote:
>> Hi Filipe,
>>
>>
>> On Tue, Aug 20, 2013 at 2:52 AM, Filipe David Borba Manana
>>  wrote:
>>>
>>> This issue is simple to reproduce and observe if kmemleak is enabled.
>>> Two simple ways to reproduce it:
>>>
>>> ** 1
>>>
>>> $ mkfs.btrfs -f /dev/loop0
>>> $ mount /dev/loop0 /mnt/btrfs
>>> $ btrfs balance start /mnt/btrfs
>>> $ umount /mnt/btrfs

So here it seems that the leak can only happen in case the block-group
has a free-space inode. This is what the orphan item is added for.
Yes, here kmemleak reports.
But: if space_cache option is disabled (and nospace_cache) enabled, it
seems that btrfs still creates the FREE_SPACE inodes, although they
are empty because in cache_save_setup:

inode = lookup_free_space_inode(root, block_group, path);
if (IS_ERR(inode) && PTR_ERR(inode) != -ENOENT) {
ret = PTR_ERR(inode);
btrfs_release_path(path);
goto out;
}

if (IS_ERR(inode)) {
...
ret = create_free_space_inode(root, trans, block_group, path);

and only later it actually sets BTRFS_DC_WRITTEN if space_cache option
is disabled. Amazing!
Although this is a different issue, do you know perhaps why these
empty inodes are needed?

Thanks!
Alex.



>>>
>>> ** 2
>>>
>>> $ mkfs.btrfs -f /dev/loop0
>>> $ mount /dev/loop0 /mnt/btrfs
>>> $ touch /mnt/btrfs/foobar
>>> $ rm -f /mnt/btrfs/foobar
>>> $ umount /mnt/btrfs
>>
>>
>> I tried the second repro script on kernel 3.8.13, and kmemleak does
>> not report a leak (even if I force the kmemleak scan). I did not try
>> the balance-repro script, though. Am I missing something?
>
> Maybe it's not an issue on 3.8.13 and older releases.
> This was on btrfs-next from August 19.
>
> thanks for testing
>
>>
>> Thanks,
>> Alex.
>>
>>
>>>
>>>
>>> After a while, kmemleak reports the leak:
>>>
>>> $ cat /sys/kernel/debug/kmemleak
>>> unreferenced object 0x880402b13e00 (size 128):
>>>   comm "btrfs", pid 19621, jiffies 4341648183 (age 70057.844s)
>>>   hex dump (first 32 bytes):
>>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>>> 00 fc c6 b1 04 88 ff ff 04 00 04 00 ad 4e ad de  .N..
>>>   backtrace:
>>> [] kmemleak_alloc+0x26/0x50
>>> [] kmem_cache_alloc_trace+0xeb/0x1d0
>>> [] btrfs_alloc_block_rsv+0x39/0x70 [btrfs]
>>> [] btrfs_orphan_add+0x13d/0x1b0 [btrfs]
>>> [] btrfs_remove_block_group+0x143/0x500 [btrfs]
>>> [] btrfs_relocate_chunk.isra.63+0x618/0x790 [btrfs]
>>> [] btrfs_balance+0x8f7/0xe90 [btrfs]
>>> [] btrfs_ioctl_balance+0x250/0x550 [btrfs]
>>> [] btrfs_ioctl+0xdfa/0x25f0 [btrfs]
>>> [] do_vfs_ioctl+0x96/0x570
>>> [] SyS_ioctl+0x91/0xb0
>>> [] system_call_fastpath+0x16/0x1b
>>> [] 0x
>>>
>>> This affects btrfs-next, revision be8e3cd00d7293dd177e3f8a4a1645ce09ca3acb
>>> (Btrfs: separate out tests into their own directory).
>>>
>>> Signed-off-by: Filipe David Borba Manana 
>>> ---
>>>
>>> V2: removed atomic_t member in struct btrfs_block_rsv, as suggested by
>>> Josef Bacik, and use instead the condition reserved == 0 to decide
>>> when to free the block.
>>> V3: simplified patch, just kfree() (and not btrfs_free_block_rsv) the
>>> root's orphan_block_rsv when free'ing the root. Thanks Josef for
>>> the suggestion.
>>> V4: use btrfs_free_block_rsv() instead of kfree(). The error I was getting
>>> in xfstests when using btrfs_free_block_rsv() was unrelated, Josef just
>>> pointed it to me (separate issue).
>>> V5: move the free call below the iput() call, so that btrfs_evict_node()
>>> can process the orphan_block_rsv first to do some needed cleanup before
>>> we free it.
>>> V6: free the root's orphan_block_rsv in close_ctree() too. After a balance
>>> the orphan_block_rsv of the tree of tree roots was being leaked, because
>>> free_fs_root() is only called for filesystem trees.
>>>
>>>  fs/btrfs/disk-io.c |5 +
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 3b12c26..5d17163 100644
>>> --- a/fs/btrfs/

Re: [PATCH v6] Btrfs: fix memory leak of orphan block rsv

2013-10-23 Thread Alex Lyakas

Hi Filipe,


On Tue, Aug 20, 2013 at 2:52 AM, Filipe David Borba Manana
 wrote:
>
> This issue is simple to reproduce and observe if kmemleak is enabled.
> Two simple ways to reproduce it:
>
> ** 1
>
> $ mkfs.btrfs -f /dev/loop0
> $ mount /dev/loop0 /mnt/btrfs
> $ btrfs balance start /mnt/btrfs
> $ umount /mnt/btrfs
>
> ** 2
>
> $ mkfs.btrfs -f /dev/loop0
> $ mount /dev/loop0 /mnt/btrfs
> $ touch /mnt/btrfs/foobar
> $ rm -f /mnt/btrfs/foobar
> $ umount /mnt/btrfs


I tried the second repro script on kernel 3.8.13, and kmemleak does
not report a leak (even if I force the kmemleak scan). I did not try
the balance-repro script, though. Am I missing something?

Thanks,
Alex.


>
>
> After a while, kmemleak reports the leak:
>
> $ cat /sys/kernel/debug/kmemleak
> unreferenced object 0x880402b13e00 (size 128):
>   comm "btrfs", pid 19621, jiffies 4341648183 (age 70057.844s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 00 fc c6 b1 04 88 ff ff 04 00 04 00 ad 4e ad de  .N..
>   backtrace:
> [] kmemleak_alloc+0x26/0x50
> [] kmem_cache_alloc_trace+0xeb/0x1d0
> [] btrfs_alloc_block_rsv+0x39/0x70 [btrfs]
> [] btrfs_orphan_add+0x13d/0x1b0 [btrfs]
> [] btrfs_remove_block_group+0x143/0x500 [btrfs]
> [] btrfs_relocate_chunk.isra.63+0x618/0x790 [btrfs]
> [] btrfs_balance+0x8f7/0xe90 [btrfs]
> [] btrfs_ioctl_balance+0x250/0x550 [btrfs]
> [] btrfs_ioctl+0xdfa/0x25f0 [btrfs]
> [] do_vfs_ioctl+0x96/0x570
> [] SyS_ioctl+0x91/0xb0
> [] system_call_fastpath+0x16/0x1b
> [] 0x
>
> This affects btrfs-next, revision be8e3cd00d7293dd177e3f8a4a1645ce09ca3acb
> (Btrfs: separate out tests into their own directory).
>
> Signed-off-by: Filipe David Borba Manana 
> ---
>
> V2: removed atomic_t member in struct btrfs_block_rsv, as suggested by
> Josef Bacik, and use instead the condition reserved == 0 to decide
> when to free the block.
> V3: simplified patch, just kfree() (and not btrfs_free_block_rsv) the
> root's orphan_block_rsv when free'ing the root. Thanks Josef for
> the suggestion.
> V4: use btrfs_free_block_rsv() instead of kfree(). The error I was getting
> in xfstests when using btrfs_free_block_rsv() was unrelated, Josef just
> pointed it to me (separate issue).
> V5: move the free call below the iput() call, so that btrfs_evict_node()
> can process the orphan_block_rsv first to do some needed cleanup before
> we free it.
> V6: free the root's orphan_block_rsv in close_ctree() too. After a balance
> the orphan_block_rsv of the tree of tree roots was being leaked, because
> free_fs_root() is only called for filesystem trees.
>
>  fs/btrfs/disk-io.c |5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 3b12c26..5d17163 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3430,6 +3430,8 @@ static void free_fs_root(struct btrfs_root *root)
>  {
> iput(root->cache_inode);
> WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
> +   btrfs_free_block_rsv(root, root->orphan_block_rsv);
> +   root->orphan_block_rsv = NULL;
> if (root->anon_dev)
> free_anon_bdev(root->anon_dev);
> free_extent_buffer(root->node);
> @@ -3582,6 +3584,9 @@ int close_ctree(struct btrfs_root *root)
>
> btrfs_free_stripe_hash_table(fs_info);
>
> +   btrfs_free_block_rsv(root, root->orphan_block_rsv);
> +   root->orphan_block_rsv = NULL;
> +
> return 0;
>  }
>
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: commit transaction after deleting a subvolume

2013-10-20 Thread Alex Lyakas

 Thank you for addressing this, David.

On Sat, Aug 31, 2013 at 1:25 AM, David Sterba  wrote:
> Alex pointed out the consequences after a transaction is not committed
> when a subvolume is deleted, so in case of a crash before an actual
> commit happens will let the subvolume reappear.
>
> Original post:
> http://www.spinics.net/lists/linux-btrfs/msg22088.html
>
> Josef's objections:
> http://www.spinics.net/lists/linux-btrfs/msg22256.html
>
> While there's no need to do a full commit for regular files, a subvolume
> may get a different treatment.
>
> http://www.spinics.net/lists/linux-btrfs/msg23087.html:
>
> "That a subvol/snapshot may appear after crash if transation commit did
> not happen does not feel so good. We know that the subvol is only
> scheduled for deletion and needs to be processed by cleaner.
>
> From that point I'd rather see the commit to happen to avoid any
> unexpected surprises.  A subvolume that re-appears still holds the data
> references and consumes space although the user does not assume that.
>
> Automated snapshotting and deleting needs some guarantees about the
> behaviour and what to do after a crash. So now it has to process the
> backlog of previously deleted snapshots and verify that they're not
> there, compared to "deleted -> will never appear, can forget about it".
> "
>
> There is a performance penalty incured by the change, but deleting a
> subvolume is not a frequent operation and the tradeoff seems justified
> by getting the guarantee stated above.
>
> CC: Alex Lyakas 
> CC: Josef Bacik 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ioctl.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index e407f75..4394632 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2268,7 +2268,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
> file *file,
>  out_end_trans:
> trans->block_rsv = NULL;
> trans->bytes_reserved = 0;
> -   ret = btrfs_end_transaction(trans, root);
> +   ret = btrfs_commit_transaction(trans, root);
> if (ret && !err)
> err = ret;
> inode->i_flags |= S_DEAD;
> --
> 1.7.9
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Btrfs: stop caching thread if extent_commit_sem is contended

2013-10-17 Thread Alex Lyakas

Thanks for addressing this issue, Josef!

On Thu, Sep 26, 2013 at 4:26 PM, Josef Bacik  wrote:
> We can starve out the transaction commit with a bunch of caching threads all
> running at the same time.  This is because we will only drop the
> extent_commit_sem if we need_resched(), which isn't likely to happen since we
> will be reading a lot from the disk so have already schedule()'ed plenty.  
> Alex
> observed that he could starve out a transaction commit for up to a minute with
> 32 caching threads all running at once.  This will allow us to drop the
> extent_commit_sem to allow the transaction commit to swap the commit_root out
> and then all the cachers will start back up. Here is an explanation provided 
> by
> Igno
>
> So, just to fill in what happens in this loop:
>
> mutex_unlock(&caching_ctl->mutex);
> cond_resched();
> goto again;
>
> where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem
> again:
>
> again:
> mutex_lock(&caching_ctl->mutex);
> /* need to make sure the commit_root doesn't disappear */
> down_read(&fs_info->extent_commit_sem);
>
> So, if I'm reading the code correct, there can be a fair amount of
> concurrency here: there may be multiple 'caching kthreads' per filesystem
> active, while there's one fs_info->extent_commit_sem per filesystem
> AFAICS.
>
> So, what happens if there are a lot of CPUs all busy holding the
> ->extent_commit_sem rwsem read-locked and a writer arrives? They'd all
> rush to try to release the fs_info->extent_commit_sem, and they'd block in
> the down_read() because there's a writer waiting.
>
> So there's a guarantee of forward progress. This should answer akpm's
> concern I think.
>
> Thanks,
>
> Acked-by: Ingo Molnar 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index cfb3cf7..cc074c34 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -442,7 +442,8 @@ next:
> if (ret)
> break;
>
> -   if (need_resched()) {
> +   if (need_resched() ||
> +   rwsem_is_contended(&fs_info->extent_commit_sem)) {
> caching_ctl->progress = last;
> btrfs_release_path(path);
> up_read(&fs_info->extent_commit_sem);
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs:async-thread: atomic_start_pending=1 is set, but it's too late

2013-08-29 Thread Alex Lyakas

Thanks, Chris, Josef, for confirming!

On Thu, Aug 29, 2013 at 11:08 PM, Chris Mason  wrote:
> Quoting Josef Bacik (2013-08-29 16:03:06)
>> On Mon, Aug 26, 2013 at 05:16:42PM +0300, Alex Lyakas wrote:
>> > Greetings all,
>> > I see a following issue with spawning new threads for btrfs_workers
>> > that have atomic_worker_start set:
>> >
>> > # I have BTRFS that has 24Gb of total metadata, out of which extent
>> > tree takes 11Gb; space_cache option is not used.
>> > # After mouting, cache_block_group() triggers ~250 work items to
>> > cache-in the needed block groups.
>> > # At this point, fs_info->caching_workers has one thread, which is
>> > considered "idle".
>> > # Work items start to add to this thread's "pending" list, until this
>> > thread becomes considered "busy".
>> > # Now workers->atomic_worker_start is set, but
>> > check_pending_worker_creates() has not run yet (it is called only from
>> > worker_loop), so the same single thread is picked as "fallback".
>> >
>> > The problem is that this thread is still running the "caching_thread"
>> > function, scanning for EXTENT_ITEMs of the first block-group. This
>> > takes 3-4seconds for 1Gb block group.
>> >
>> > # Once caching_thread() exits, check_pending_worker_creates() is
>> > called, and creates the second thread, but it's too late, because all
>> > the 250 work items are already sitting in the first thread's "pending"
>> > list. So the  second thread doesn't help at all.
>> >
>> > As a result, all block-group caching is performed by the same thread,
>> > which, due to one-by-one scanning of EXTENT_ITEMs, takes forever for
>> > this BTRFS.
>> >
>> > How this can be fixed?
>> > - can perhaps check_pending_worker_creates() be called out of
>> > worker_loop, e.g., by find_worker()? Instead of just setting
>> > workers->atomic_start_pending?
>> > - maybe for fs_info->caching_workers we don't have to create new
>> > workers asynchronously, so we can pass NULL for async_helper in
>> > btrfs_init_workers()? (probably we have to, just checking)
>>
>> So I looked at this, and I'm pretty sure we have an async_helper just 
>> because of
>> copy+paste.  "Hey I want a new async group, let me copy this other one and
>> change the name!"  So yes let's just pass NULL here.  In fact the only cases
>> that we should be using an async helper is for super critical areas, so I'm
>> pretty sure _most_ of the cases that specify an async helper don't need to.
>> Chris is this correct, or am I missing something?  Thanks,
>
> No, I think we can just turn off the async start here.
>
> -chris
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Notify caching_thread()s to give up on extent_commit_sem when needed.

2013-08-29 Thread Alex Lyakas

On Thu, Aug 29, 2013 at 10:55 PM, Josef Bacik  wrote:
> On Thu, Aug 29, 2013 at 10:09:29PM +0300, Alex Lyakas wrote:
>> Hi Josef,
>>
>> On Thu, Aug 29, 2013 at 5:38 PM, Josef Bacik  wrote:
>> > On Thu, Aug 29, 2013 at 01:31:05PM +0300, Alex Lyakas wrote:
>> >> caching_thread()s do all their work under read access to 
>> >> extent_commit_sem.
>> >> They give up on this read access only when need_resched() tells them, or
>> >> when they exit. As a result, somebody that wants a WRITE access to this 
>> >> sem,
>> >> might wait for a long time. Especially this is problematic in
>> >> cache_block_group(),
>> >> which can be called on critical paths like find_free_extent() and in 
>> >> commit
>> >> path via commit_cowonly_roots().
>> >>
>> >> This patch is an RFC, that attempts to fix this problem, by notifying the
>> >> caching threads to give up on extent_commit_sem.
>> >>
>> >> On a system with a lot of metadata (~20Gb total metadata, ~10Gb extent 
>> >> tree),
>> >> with increased number of caching_threads, commits were very slow,
>> >> stuck in commit_cowonly_roots, due to this issue.
>> >> With this patch, commits no longer get stuck in commit_cowonly_roots.
>> >>
>> >
>> > But what kind of effect do you see on overall performance/runtime?  
>> > Honestly I'd
>> > expect we'd spend more of our time waiting for the caching kthread to fill 
>> > in
>> > free space so we can make allocations than waiting on this lock 
>> > contention.  I'd
>> > like to see real numbers here to see what kind of effect this patch has on 
>> > your
>> > workload.  (I don't doubt it makes a difference, I'm just curious to see 
>> > how big
>> > of a difference it makes.)
>>
>> Primarily for me it affects the commit thread right after mounting,
>> when it spends time in the "critical part" of the commit, in which
>> trans_no_join is set, i.e., it is not possible to start a new
>> transaction. So all the new writers that want a transaction are
>> delayed at this point.
>>
>> Here are some numbers (and some more logs are in the attached file).
>>
>> Filesystem has a good amount of metadata (btrfs-progs modified
>> slightly to print exact byte values):
>> root@dc:/home/zadara# btrfs fi df /btrfs/pool-0002/
>> Data: total=846116945920(788.01GB), used=842106667008(784.27GB)
>> System: total=4194304(4.00MB), used=94208(92.00KB)
>> Metadata: total=31146901504(29.01GB), used=25248698368(23.51GB)
>>
>> original code, 2 caching workers, try 1
>> Aug 29 13:41:22 dc kernel: [28381.203745] [17617][tx]btrfs
>> [ZBTRFS_TXN_COMMIT_PHASE_STARTED:439] FS[dm-119] txn[6627] COMMIT
>> extwr:0 wr:1
>> Aug 29 13:41:25 dc kernel: [28384.624838] [17617][tx]btrfs
>> [ZBTRFS_TXN_COMMIT_PHASE_DONE:519] FS[dm-119] txn[6627] COMMIT took
>> 3421 ms committers=1 open=0ms blocked=3188ms
>> Aug 29 13:41:25 dc kernel: [28384.624846] [17617][tx]btrfs
>> [ZBTRFS_TXN_COMMIT_PHASE_DONE:524] FS[dm-119] txn[6627] roo:0 rdr1:0
>> cbg:0 rdr2:0
>> Aug 29 13:41:25 dc kernel: [28384.624850] [17617][tx]btrfs
>> [ZBTRFS_TXN_COMMIT_PHASE_DONE:529] FS[dm-119] txn[6627] wc:0 wpc:0
>> wew:0 fps:0
>> Aug 29 13:41:25 dc kernel: [28384.624854] [17617][tx]btrfs
>> [ZBTRFS_TXN_COMMIT_PHASE_DONE:534] -FS[dm-119] txn[6627] ww:0 cs:0
>> rdi:0 rdr3:0
>> Aug 29 13:41:25 dc kernel: [28384.624858] [17617][tx]btrfs
>> [ZBTRFS_TXN_COMMIT_PHASE_DONE:538] -FS[dm-119] txn[6627] cfr:0
>> ccr:2088 pec:1099
>> Aug 29 13:41:25 dc kernel: [28384.624862] [17617][tx]btrfs
>> [ZBTRFS_TXN_COMMIT_PHASE_DONE:541] FS[dm-119] txn[6627] wrw:230 wrs:1
>>
>> I have a breakdown of commit times here, to identify bottlenecks of
>> the commit. Times are in ms.
>> Names of phases are:
>>
>> roo - btrfs_run_ordered_operations
>> rdr1 - btrfs_run_delayed_refs (call 1)
>> cbg - btrfs_create_pending_block_groups
>> rdr2 - btrfs_run_delayed_refs (call 2)
>> wc - wait_for_commit (if was needed)
>> wpc - wair for previous commit (if was needed)
>> wew - wait for "external writers to detach"
>> fps - flush_all_pending_stuffs
>> ww - wait for all the other writers to detach
>> cs - create_pending_snapshots
>> rdi - btrfs_run_delayed_items
>> rdr3 - btrfs_run_delayed_refs (call 3)
>> cfr - commit_fs_roots
>> ccr - commit_cowonly_roots
>> pec - btrfs_prepare_extent_commi

[PATCH] Notify caching_thread()s to give up on extent_commit_sem when needed.

2013-08-29 Thread Alex Lyakas

caching_thread()s do all their work under read access to extent_commit_sem.
They give up on this read access only when need_resched() tells them, or
when they exit. As a result, somebody that wants a WRITE access to this sem,
might wait for a long time. Especially this is problematic in
cache_block_group(),
which can be called on critical paths like find_free_extent() and in commit
path via commit_cowonly_roots().

This patch is an RFC, that attempts to fix this problem, by notifying the
caching threads to give up on extent_commit_sem.

On a system with a lot of metadata (~20Gb total metadata, ~10Gb extent tree),
with increased number of caching_threads, commits were very slow,
stuck in commit_cowonly_roots, due to this issue.
With this patch, commits no longer get stuck in commit_cowonly_roots.

This patch is not indented to be applied, just a request to comment on whether
you agree this problem happens, and whether the fix goes in the right direction.

Signed-off-by: Alex Lyakas 
---
 fs/btrfs/ctree.h   |7 +++
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/extent-tree.c |9 +
 fs/btrfs/transaction.c |2 +-
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c90be01..b602611 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1427,6 +1427,13 @@ struct btrfs_fs_info {
 struct mutex ordered_extent_flush_mutex;

 struct rw_semaphore extent_commit_sem;
+/* notifies the readers to give up on the sem ASAP */
+atomic_t extent_commit_sem_give_up_read;
+#define BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info)  \
+do { atomic_inc(&(fs_info)->extent_commit_sem_give_up_read); \
+ down_write(&(fs_info)->extent_commit_sem);  \
+ atomic_dec(&(fs_info)->extent_commit_sem_give_up_read); \
+} while (0)

 struct rw_semaphore cleanup_work_sem;

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 69e9afb..b88e688 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2291,6 +2291,7 @@ int open_ctree(struct super_block *sb,
 mutex_init(&fs_info->cleaner_mutex);
 mutex_init(&fs_info->volume_mutex);
 init_rwsem(&fs_info->extent_commit_sem);
+atomic_set(&fs_info->extent_commit_sem_give_up_read, 0);
 init_rwsem(&fs_info->cleanup_work_sem);
 init_rwsem(&fs_info->subvol_sem);
 sema_init(&fs_info->uuid_tree_rescan_sem, 1);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 95c6539..28fee78 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -442,7 +442,8 @@ next:
 if (ret)
 break;

-if (need_resched()) {
+if (need_resched() ||
+atomic_read(&fs_info->extent_commit_sem_give_up_read) > 0) {
 caching_ctl->progress = last;
 btrfs_release_path(path);
 up_read(&fs_info->extent_commit_sem);
@@ -632,7 +633,7 @@ static int cache_block_group(struct
btrfs_block_group_cache *cache,
 return 0;
 }

-down_write(&fs_info->extent_commit_sem);
+BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info);
 atomic_inc(&caching_ctl->count);
 list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
 up_write(&fs_info->extent_commit_sem);
@@ -5462,7 +5463,7 @@ void btrfs_prepare_extent_commit(struct
btrfs_trans_handle *trans,
 struct btrfs_block_group_cache *cache;
 struct btrfs_space_info *space_info;

-down_write(&fs_info->extent_commit_sem);
+BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info);

 list_for_each_entry_safe(caching_ctl, next,
  &fs_info->caching_block_groups, list) {
@@ -8219,7 +8220,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 struct btrfs_caching_control *caching_ctl;
 struct rb_node *n;

-down_write(&info->extent_commit_sem);
+BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info);
 while (!list_empty(&info->caching_block_groups)) {
 caching_ctl = list_entry(info->caching_block_groups.next,
  struct btrfs_caching_control, list);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index cac4a3f..976d20a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -969,7 +969,7 @@ static noinline int commit_cowonly_roots(struct
btrfs_trans_handle *trans,
 return ret;
 }

-down_write(&fs_info->extent_commit_sem);
+BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info);
 switch_commit_root(fs_info->extent_root);
 up_write(&fs_info->extent_commit_sem);

-- 
1.7.9.5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: handle errors when doing slow caching

2013-08-27 Thread Alex Lyakas

Hi Josef,
thanks for addressing this.

On Mon, Aug 5, 2013 at 6:19 PM, Josef Bacik  wrote:
> Alex Lyakas reported a bug where wait_block_group_cache_progress() would wait
> forever if a drive failed.  This is because we just bail out if there is an
> error while trying to cache a block group, we don't update anybody who may be
> waiting.  So this introduces a new enum for the cache state in case of error 
> and
> makes everybody bail out if we have an error.  Alex tested and verified this
> patch fixed his problem.  This fixes bz 59431.  Thanks,
>
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   |1 +
>  fs/btrfs/extent-tree.c |   27 ---
>  2 files changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index cbb1263..c17acbc 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1188,6 +1188,7 @@ enum btrfs_caching_type {
> BTRFS_CACHE_STARTED = 1,
> BTRFS_CACHE_FAST= 2,
> BTRFS_CACHE_FINISHED= 3,
> +   BTRFS_CACHE_ERROR   = 4,
>  };
>
>  enum btrfs_disk_cache_state {
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e868c35..e6dfa7f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -113,7 +113,8 @@ static noinline int
>  block_group_cache_done(struct btrfs_block_group_cache *cache)
>  {
> smp_mb();
> -   return cache->cached == BTRFS_CACHE_FINISHED;
> +   return cache->cached == BTRFS_CACHE_FINISHED ||
> +   cache->cached == BTRFS_CACHE_ERROR;
>  }
>
>  static int block_group_bits(struct btrfs_block_group_cache *cache, u64 bits)
> @@ -389,7 +390,7 @@ static noinline void caching_thread(struct btrfs_work 
> *work)
> u64 total_found = 0;
> u64 last = 0;
> u32 nritems;
> -   int ret = 0;
> +   int ret = -ENOMEM;
>
> caching_ctl = container_of(work, struct btrfs_caching_control, work);
> block_group = caching_ctl->block_group;
> @@ -517,6 +518,12 @@ err:
>
> mutex_unlock(&caching_ctl->mutex);
>  out:
> +   if (ret) {
> +   spin_lock(&block_group->lock);
> +   block_group->caching_ctl = NULL;
> +   block_group->cached = BTRFS_CACHE_ERROR;
> +   spin_unlock(&block_group->lock);
> +   }
> wake_up(&caching_ctl->wait);
>
> put_caching_control(caching_ctl);
> @@ -6035,8 +6042,11 @@ static u64 stripe_align(struct btrfs_root *root,
>   * for our min num_bytes.  Another option is to have it go ahead
>   * and look in the rbtree for a free extent of a given size, but this
>   * is a good start.
> + *
> + * Callers of this must check if cache->cached == BTRFS_CACHE_ERROR before 
> using
> + * any of the information in this block group.
>   */
> -static noinline int
> +static noinline void
>  wait_block_group_cache_progress(struct btrfs_block_group_cache *cache,
> u64 num_bytes)
>  {
> @@ -6044,28 +6054,29 @@ wait_block_group_cache_progress(struct 
> btrfs_block_group_cache *cache,
>
> caching_ctl = get_caching_control(cache);
> if (!caching_ctl)
> -   return 0;
> +   return;
>
> wait_event(caching_ctl->wait, block_group_cache_done(cache) ||
>(cache->free_space_ctl->free_space >= num_bytes));
>
> put_caching_control(caching_ctl);
> -   return 0;
>  }
>
>  static noinline int
>  wait_block_group_cache_done(struct btrfs_block_group_cache *cache)
>  {
> struct btrfs_caching_control *caching_ctl;
> +   int ret = 0;
>
> caching_ctl = get_caching_control(cache);
> if (!caching_ctl)
> return 0;
In case caching_thread completes with error for this block group,
get_caching_control() will return NULL.
So this function will return success, although the block group was not
cached properly.
Currently only btrfs_trim_fs() caller checks the return value of this
function, although you didn't post the btrfs_trim_fs() change in this
patch (but you posed it in the bugzilla). Still, should we check the
cache->cached for ERROR even if there is no caching control?


>
> wait_event(caching_ctl->wait, block_group_cache_done(cache));
> -
> +   if (cache->cached == BTRFS_CACHE_ERROR)
> +   ret = -EIO;
> put_caching_control(caching_ctl);
> -   return 0;
> +   return ret;
>  }
>
>  int __get_raid_index(u64 flags)
> @@ -6248,6 +6259,8 @@ have_block_group:
> ret = 0;
>

btrfs:async-thread: atomic_start_pending=1 is set, but it's too late

2013-08-26 Thread Alex Lyakas

Greetings all,
I see a following issue with spawning new threads for btrfs_workers
that have atomic_worker_start set:

# I have BTRFS that has 24Gb of total metadata, out of which extent
tree takes 11Gb; space_cache option is not used.
# After mouting, cache_block_group() triggers ~250 work items to
cache-in the needed block groups.
# At this point, fs_info->caching_workers has one thread, which is
considered "idle".
# Work items start to add to this thread's "pending" list, until this
thread becomes considered "busy".
# Now workers->atomic_worker_start is set, but
check_pending_worker_creates() has not run yet (it is called only from
worker_loop), so the same single thread is picked as "fallback".

The problem is that this thread is still running the "caching_thread"
function, scanning for EXTENT_ITEMs of the first block-group. This
takes 3-4seconds for 1Gb block group.

# Once caching_thread() exits, check_pending_worker_creates() is
called, and creates the second thread, but it's too late, because all
the 250 work items are already sitting in the first thread's "pending"
list. So the  second thread doesn't help at all.

As a result, all block-group caching is performed by the same thread,
which, due to one-by-one scanning of EXTENT_ITEMs, takes forever for
this BTRFS.

How this can be fixed?
- can perhaps check_pending_worker_creates() be called out of
worker_loop, e.g., by find_worker()? Instead of just setting
workers->atomic_start_pending?
- maybe for fs_info->caching_workers we don't have to create new
workers asynchronously, so we can pass NULL for async_helper in
btrfs_init_workers()? (probably we have to, just checking)
- any other way?

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: stop all workers before cleaning up roots

2013-08-01 Thread Alex Lyakas

Hi Josef,

On Thu, May 30, 2013 at 11:58 PM, Josef Bacik  wrote:
> Dave reported a panic because the extent_root->commit_root was NULL in the
> caching kthread.  That is because we just unset it in free_root_pointers, 
> which
> is not the correct thing to do, we have to either wait for the caching kthread
> to complete or hold the extent_commit_sem lock so we know the thread has 
> exited.
> This patch makes the kthreads all stop first and then we do our cleanup.  This
> should fix the race.  Thanks,
>
> Reported-by: David Sterba 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/disk-io.c |6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2b53afd..77cb566 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3547,13 +3547,13 @@ int close_ctree(struct btrfs_root *root)
>
> btrfs_free_block_groups(fs_info);

do you think it would be safer to stop all workers first and make sure
they are stopped, then do btrfs_free_block_groups()? I see, for
example, that btrfs_free_block_groups() checks:
if (block_group->cached == BTRFS_CACHE_STARTED)
which could be perhaps racy with other people spawning caching_threads.

So maybe better to stop all threads (including cleaner and committer)
and then free everything?

>
> -   free_root_pointers(fs_info, 1);
> +   btrfs_stop_all_workers(fs_info);
>
> del_fs_roots(fs_info);
>
> -   iput(fs_info->btree_inode);
> +   free_root_pointers(fs_info, 1);
>
> -   btrfs_stop_all_workers(fs_info);
> +   iput(fs_info->btree_inode);
>
>  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
> if (btrfs_test_opt(root, CHECK_INTEGRITY))
> --
> 1.7.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix all callers of read_tree_block

2013-07-30 Thread Alex Lyakas

Hi Josef,

On Tue, Apr 23, 2013 at 9:20 PM, Josef Bacik  wrote:
> We kept leaking extent buffers when mounting a broken file system and it turns
> out it's because not everybody uses read_tree_block properly.  You need to 
> check
> and make sure the extent_buffer is uptodate before you use it.  This patch 
> fixes
> everybody who calls read_tree_block directly to make sure they check that it 
> is
> uptodate and free it and return an error if it is not.  With this we no longer
> leak EB's when things go horribly wrong.  Thanks,
>
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/backref.c |   10 --
>  fs/btrfs/ctree.c   |   21 -
>  fs/btrfs/disk-io.c |   19 +--
>  fs/btrfs/extent-tree.c |4 +++-
>  fs/btrfs/relocation.c  |   18 +++---
>  5 files changed, 59 insertions(+), 13 deletions(-)
>
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 23e927b..04b5b30 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -423,7 +423,10 @@ static int __add_missing_keys(struct btrfs_fs_info 
> *fs_info,
> BUG_ON(!ref->wanted_disk_byte);
> eb = read_tree_block(fs_info->tree_root, 
> ref->wanted_disk_byte,
>  fs_info->tree_root->leafsize, 0);
> -   BUG_ON(!eb);
> +   if (!eb || !extent_buffer_uptodate(eb)) {
> +   free_extent_buffer(eb);
> +   return -EIO;
> +   }
> btrfs_tree_read_lock(eb);
> if (btrfs_header_level(eb) == 0)
> btrfs_item_key_to_cpu(eb, &ref->key_for_search, 0);
> @@ -913,7 +916,10 @@ again:
> info_level);
> eb = read_tree_block(fs_info->extent_root,
>ref->parent, bsz, 
> 0);
> -   BUG_ON(!eb);
> +   if (!eb || !extent_buffer_uptodate(eb)) {
> +   free_extent_buffer(eb);
> +   return -EIO;
> +   }
> ret = find_extent_in_eb(eb, bytenr,
> *extent_item_pos, 
> &eie);
> ref->inode_list = eie;
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 566d99b..2bc3440 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1281,7 +1281,8 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
> free_extent_buffer(eb_root);
> blocksize = btrfs_level_size(root, old_root->level);
> old = read_tree_block(root, logical, blocksize, 0);
> -   if (!old) {
> +   if (!old || !extent_buffer_uptodate(old)) {
> +   free_extent_buffer(old);
> pr_warn("btrfs: failed to read tree block %llu from 
> get_old_root\n",
> logical);
> WARN_ON(1);
> @@ -1526,8 +1527,10 @@ int btrfs_realloc_node(struct btrfs_trans_handle 
> *trans,
> if (!cur) {
> cur = read_tree_block(root, blocknr,
>  blocksize, gen);
> -   if (!cur)
> +   if (!cur || !extent_buffer_uptodate(cur)) {
> +   free_extent_buffer(cur);
> return -EIO;
> +   }
> } else if (!uptodate) {
> err = btrfs_read_buffer(cur, gen);
> if (err) {
> @@ -1692,6 +1695,8 @@ static noinline struct extent_buffer 
> *read_node_slot(struct btrfs_root *root,
>struct extent_buffer *parent, int slot)
>  {
> int level = btrfs_header_level(parent);
> +   struct extent_buffer *eb;
> +
> if (slot < 0)
> return NULL;
> if (slot >= btrfs_header_nritems(parent))
> @@ -1699,9 +1704,15 @@ static noinline struct extent_buffer 
> *read_node_slot(struct btrfs_root *root,
>
> BUG_ON(level == 0);
>
> -   return read_tree_block(root, btrfs_node_blockptr(parent, slot),
> -  btrfs_level_size(root, level - 1),
> -  btrfs_node_ptr_generation(parent, slot));
> +   eb = read_tree_block(root, btrfs_node_blockptr(parent, slot),
> +btrfs_level_size(root, level - 1),
> +btrfs_node_ptr_generation(parent, slot));
> +   if (eb && !extent_buffer_uptodate(eb)) {
> +   free_extent_buffer(eb);
> +   eb = NULL;
> +   }
> +
> +   return eb;
>  }
>
>  /*
> diff --git a/fs/bt

Re: [PATCH] Btrfs: update drop progress before stopping snapshot dropping

2013-07-30 Thread Alex Lyakas

Thanks for posting that patch, Josef.

On Mon, Jul 15, 2013 at 6:59 PM, Josef Bacik  wrote:
>
> Alex pointed out a problem and fix that exists in the drop one snapshot at
> a
> time patch.  If we decide we need to exit for whatever reason (umount for
> example) we will just exit the snapshot dropping without updating the drop
> progress.  So the next time we go to resume we will BUG_ON() because we
> can't
> find the extent we left off at because we never updated it.  This patch
> fixes
> the problem.
>
> Cc: sta...@vger.kernel.org
> Reported-by: Alex Lyakas 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c |   14 --
>  1 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index bc00b24..8c204e1 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7584,11 +7584,6 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> wc->reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
>
> while (1) {
> -   if (!for_reloc && btrfs_need_cleaner_sleep(root)) {
> -   pr_debug("btrfs: drop snapshot early exit\n");
> -   err = -EAGAIN;
> -   goto out_end_trans;
> -   }
>
> ret = walk_down_tree(trans, root, path, wc);
> if (ret < 0) {
> @@ -7616,7 +7611,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> }
>
> BUG_ON(wc->level == 0);
> -   if (btrfs_should_end_transaction(trans, tree_root)) {
> +   if (btrfs_should_end_transaction(trans, tree_root) ||
> +   (!for_reloc && btrfs_need_cleaner_sleep(root))) {
> ret = btrfs_update_root(trans, tree_root,
> &root->root_key,
> root_item);
> @@ -7627,6 +7623,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> }
>
> btrfs_end_transaction_throttle(trans, tree_root);
> +   if (!for_reloc && btrfs_need_cleaner_sleep(root))
> {
> +   pr_debug("btrfs: drop snapshot early
> exit\n");
> +   err = -EAGAIN;
> +   goto out_free;
> +   }
> +
> trans = btrfs_start_transaction(tree_root, 0);
> if (IS_ERR(trans)) {
> err = PTR_ERR(trans);
> --
> 1.7.7.6
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix lock leak when resuming snapshot deletion

2013-07-16 Thread Alex Lyakas

On Mon, Jul 15, 2013 at 7:43 PM, Josef Bacik  wrote:
> We aren't setting path->locks[level] when we resume a snapshot deletion which
> means we won't unlock the buffer when we free the path.  This causes deadlocks
> if we happen to re-allocate the block before we've evicted the extent buffer
> from cache.  Thanks,
>
> Reported-by: Alex Lyakas 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c |2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 8c204e1..997a5dd 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7555,6 +7555,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> while (1) {
> btrfs_tree_lock(path->nodes[level]);
> btrfs_set_lock_blocking(path->nodes[level]);
> +   path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
>
> ret = btrfs_lookup_extent_info(trans, root,
> path->nodes[level]->start,
> @@ -7570,6 +7571,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> break;
>
> btrfs_tree_unlock(path->nodes[level]);
> +   path->locks[level] = 0;
> WARN_ON(wc->refs[level] != 1);
> level--;
> }
> --
> 1.7.7.6
>
> --

Tested-by: Liran Strugano 

> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: clean snapshots one by one

2013-07-14 Thread Alex Lyakas

Hi,

On Thu, Jul 4, 2013 at 10:52 PM, Alex Lyakas
 wrote:
> Hi David,
>
> On Thu, Jul 4, 2013 at 8:03 PM, David Sterba  wrote:
>> On Thu, Jul 04, 2013 at 06:29:23PM +0300, Alex Lyakas wrote:
>>> > @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>>> > wc->reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
>>> >
>>> > while (1) {
>>> > +   if (!for_reloc && btrfs_fs_closing(root->fs_info)) {
>>> > +   pr_debug("btrfs: drop snapshot early exit\n");
>>> > +   err = -EAGAIN;
>>> > +   goto out_end_trans;
>>> > +   }
>>> Here you exit the loop, but the "drop_progress" in the root item is
>>> incorrect. When the system is remounted, and snapshot deletion
>>> resumes, it seems that it tries to resume from the EXTENT_ITEM that
>>> does not exist anymore, and [1] shows that btrfs_lookup_extent_info()
>>> simply does not find the needed extent.
>>> So then I hit panic in walk_down_tree():
>>> BUG: wc->refs[level - 1] == 0
>>>
>>> I fixed it like follows:
>>> There is a place where btrfs_drop_snapshot() checks if it needs to
>>> detach from transaction and re-attach. So I moved the exit point there
>>> and the code is like this:
>>>
>>>   if (btrfs_should_end_transaction(trans, tree_root) ||
>>>   (!for_reloc && btrfs_need_cleaner_sleep(root))) {
>>>   ret = btrfs_update_root(trans, tree_root,
>>>   &root->root_key,
>>>   root_item);
>>>   if (ret) {
>>>   btrfs_abort_transaction(trans, tree_root, 
>>> ret);
>>>   err = ret;
>>>   goto out_end_trans;
>>>   }
>>>
>>>   btrfs_end_transaction_throttle(trans, tree_root);
>>>   if (!for_reloc && btrfs_need_cleaner_sleep(root)) {
>>>   err = -EAGAIN;
>>>   goto out_free;
>>>   }
>>>   trans = btrfs_start_transaction(tree_root, 0);
>>> ...
>>>
>>> With this fix, I do not hit the panic, and snapshot deletion proceeds
>>> and completes alright after mount.
>>>
>>> Do you agree to my analysis or I am missing something? It seems that
>>> Josef's btrfs-next still has this issue (as does Chris's for-linus).
>>
>> Sound analysis and I agree with the fix. The clean-by-one patch has been
>> merged into 3.10 so we need a stable fix for that.
> Thanks for confirming, David!
>
> From more testing, I have two more notes:
>
> # After applying the fix, whenever snapshot deletion is resumed after
> mount, and successfully completes, then I unmount again, and rmmod
> btrfs, linux complains about loosing few "struct extent_buffer" during
> kem_cache_delete().
> So somewhere on that path:
> if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
> ...
> } else {
> ===> HERE
>
> and later we perhaps somehow overwrite the contents of "struct
> btrfs_path" that is used in the whole function. Because at the end of
> the function we always do btrfs_free_path(), which inside does
> btrfs_release_path().  I was not able to determine where the leak
> happens, do you have any hint? No other activity happens in the system
> except the resumed snap deletion, and this problem only happens when
> resuming.
>
I found where the memory leak happens. When we abort snapshot deletion
in the middle, then this btrfs_root is basically left alone hanging in
the air. It is out of the "dead_roots" already, so when del_fs_roots()
is called during unmount, it will not free this root and its
root->node (which is the one that triggers memory leak warning on
kmem_cache_destroy) and perhaps other stuff too. The issue still
exists in btrfs-next.

Simplest fix I came up with was:

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d275681..52a2c54 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7468,6 +7468,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
int err = 0;
int ret;
int level;
+   bool root_freed = false;

path = btrfs_alloc_path();
if (!path) {

Re: [PATCH v3] btrfs: clean snapshots one by one

2013-07-04 Thread Alex Lyakas

Hi David,

On Thu, Jul 4, 2013 at 8:03 PM, David Sterba  wrote:
> On Thu, Jul 04, 2013 at 06:29:23PM +0300, Alex Lyakas wrote:
>> > @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>> > wc->reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
>> >
>> > while (1) {
>> > +   if (!for_reloc && btrfs_fs_closing(root->fs_info)) {
>> > +   pr_debug("btrfs: drop snapshot early exit\n");
>> > +   err = -EAGAIN;
>> > +   goto out_end_trans;
>> > +   }
>> Here you exit the loop, but the "drop_progress" in the root item is
>> incorrect. When the system is remounted, and snapshot deletion
>> resumes, it seems that it tries to resume from the EXTENT_ITEM that
>> does not exist anymore, and [1] shows that btrfs_lookup_extent_info()
>> simply does not find the needed extent.
>> So then I hit panic in walk_down_tree():
>> BUG: wc->refs[level - 1] == 0
>>
>> I fixed it like follows:
>> There is a place where btrfs_drop_snapshot() checks if it needs to
>> detach from transaction and re-attach. So I moved the exit point there
>> and the code is like this:
>>
>>   if (btrfs_should_end_transaction(trans, tree_root) ||
>>   (!for_reloc && btrfs_need_cleaner_sleep(root))) {
>>   ret = btrfs_update_root(trans, tree_root,
>>   &root->root_key,
>>   root_item);
>>   if (ret) {
>>   btrfs_abort_transaction(trans, tree_root, ret);
>>   err = ret;
>>   goto out_end_trans;
>>   }
>>
>>   btrfs_end_transaction_throttle(trans, tree_root);
>>   if (!for_reloc && btrfs_need_cleaner_sleep(root)) {
>>   err = -EAGAIN;
>>   goto out_free;
>>   }
>>   trans = btrfs_start_transaction(tree_root, 0);
>> ...
>>
>> With this fix, I do not hit the panic, and snapshot deletion proceeds
>> and completes alright after mount.
>>
>> Do you agree to my analysis or I am missing something? It seems that
>> Josef's btrfs-next still has this issue (as does Chris's for-linus).
>
> Sound analysis and I agree with the fix. The clean-by-one patch has been
> merged into 3.10 so we need a stable fix for that.
Thanks for confirming, David!

>From more testing, I have two more notes:

# After applying the fix, whenever snapshot deletion is resumed after
mount, and successfully completes, then I unmount again, and rmmod
btrfs, linux complains about loosing few "struct extent_buffer" during
kem_cache_delete().
So somewhere on that path:
if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
...
} else {
===> HERE

and later we perhaps somehow overwrite the contents of "struct
btrfs_path" that is used in the whole function. Because at the end of
the function we always do btrfs_free_path(), which inside does
btrfs_release_path().  I was not able to determine where the leak
happens, do you have any hint? No other activity happens in the system
except the resumed snap deletion, and this problem only happens when
resuming.

# This is for Josef: after I unmount the fs with ongoing snap deletion
(after applying my fix), and run the latest btrfsck - it complains a
lot about problems in extent tree:( But after I mount again, snap
deletion resumes then completes, then I unmount and btrfsck is happy
again. So probably it does not account orphan roots properly?

David, will you provide a fixed patch, if possible?

Thanks!
Alex.

>
> thanks,
> david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: clean snapshots one by one

2013-07-04 Thread Alex Lyakas

Hi David,
I believe this patch has the following problem:

On Tue, Mar 12, 2013 at 5:13 PM, David Sterba  wrote:
> Each time pick one dead root from the list and let the caller know if
> it's needed to continue. This should improve responsiveness during
> umount and balance which at some point waits for cleaning all currently
> queued dead roots.
>
> A new dead root is added to the end of the list, so the snapshots
> disappear in the order of deletion.
>
> The snapshot cleaning work is now done only from the cleaner thread and the
> others wake it if needed.
>
> Signed-off-by: David Sterba 
> ---
>
> v1,v2:
> * http://thread.gmane.org/gmane.comp.file-systems.btrfs/23212
>
> v2->v3:
> * remove run_again from btrfs_clean_one_deleted_snapshot and return 1
>   unconditionally
>
>  fs/btrfs/disk-io.c |   10 ++--
>  fs/btrfs/extent-tree.c |8 ++
>  fs/btrfs/relocation.c  |3 --
>  fs/btrfs/transaction.c |   56 +++
>  fs/btrfs/transaction.h |2 +-
>  5 files changed, 53 insertions(+), 26 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 988b860..4de2351 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1690,15 +1690,19 @@ static int cleaner_kthread(void *arg)
> struct btrfs_root *root = arg;
>
> do {
> +   int again = 0;
> +
> if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
> +   down_read_trylock(&root->fs_info->sb->s_umount) &&
> mutex_trylock(&root->fs_info->cleaner_mutex)) {
> btrfs_run_delayed_iputs(root);
> -   btrfs_clean_old_snapshots(root);
> +   again = btrfs_clean_one_deleted_snapshot(root);
> mutex_unlock(&root->fs_info->cleaner_mutex);
> btrfs_run_defrag_inodes(root->fs_info);
> +   up_read(&root->fs_info->sb->s_umount);
> }
>
> -   if (!try_to_freeze()) {
> +   if (!try_to_freeze() && !again) {
> set_current_state(TASK_INTERRUPTIBLE);
> if (!kthread_should_stop())
> schedule();
> @@ -3403,8 +3407,8 @@ int btrfs_commit_super(struct btrfs_root *root)
>
> mutex_lock(&root->fs_info->cleaner_mutex);
> btrfs_run_delayed_iputs(root);
> -   btrfs_clean_old_snapshots(root);
> mutex_unlock(&root->fs_info->cleaner_mutex);
> +   wake_up_process(root->fs_info->cleaner_kthread);
>
> /* wait until ongoing cleanup work done */
> down_write(&root->fs_info->cleanup_work_sem);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 742b7a7..a08d0fe 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7263,6 +7263,8 @@ static noinline int walk_up_tree(struct 
> btrfs_trans_handle *trans,
>   * reference count by one. if update_ref is true, this function
>   * also make sure backrefs for the shared block and all lower level
>   * blocks are properly updated.
> + *
> + * If called with for_reloc == 0, may exit early with -EAGAIN
>   */
>  int btrfs_drop_snapshot(struct btrfs_root *root,
>  struct btrfs_block_rsv *block_rsv, int update_ref,
> @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> wc->reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
>
> while (1) {
> +   if (!for_reloc && btrfs_fs_closing(root->fs_info)) {
> +   pr_debug("btrfs: drop snapshot early exit\n");
> +   err = -EAGAIN;
> +   goto out_end_trans;
> +   }
Here you exit the loop, but the "drop_progress" in the root item is
incorrect. When the system is remounted, and snapshot deletion
resumes, it seems that it tries to resume from the EXTENT_ITEM that
does not exist anymore, and [1] shows that btrfs_lookup_extent_info()
simply does not find the needed extent.
So then I hit panic in walk_down_tree():
BUG: wc->refs[level - 1] == 0

I fixed it like follows:
There is a place where btrfs_drop_snapshot() checks if it needs to
detach from transaction and re-attach. So I moved the exit point there
and the code is like this:

if (btrfs_should_end_transaction(trans, tree_root) ||
(!for_reloc && btrfs_need_cleaner_sleep(root))) {
ret = btrfs_update_root(trans, tree_root,
&root->root_key,
root_item);
if (ret) {
btrfs_abort_transaction(trans, tree_root, ret);
err = ret;
goto out_end_trans;
}

btrfs_end_transaction_throttle(trans, tree_root);
if (!for_reloc && btr

Re: question about transaction-abort-related commits

2013-07-02 Thread Alex Lyakas

On Sun, Jun 30, 2013 at 2:36 PM, Josef Bacik  wrote:
> On Sun, Jun 30, 2013 at 11:29:14AM +0300, Alex Lyakas wrote:
>> Hi Josef,
>>
>> On Wed, Jun 26, 2013 at 5:16 PM, Alex Lyakas
>>  wrote:
>> > Hi Josef,
>> > Can you please help me with another question.
>> >
>> > I am looking at your patch:
>> > Btrfs: fix chunk allocation error handling
>> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0448748849ef7c593be40e2c1404f7974bd3aac6
>> >
>> > Here you changed the order of btrfs_make_block_group() vs
>> > btrfs_alloc_dev_extent(), because we could have allocated from the
>> > in-memory block group, before we have inserted the dev extent into a
>> > tree. However, with this fix, I hit the deadlock[1] of
>> > btrfs_alloc_dev_extent() that also wants to allocate a chunk and
>> > recursively calls do_chunk_alloc, but then is stuck on chunk_mutex.
>> >
>> > Was this patch:
>> > Btrfs: don't re-enter when allocating a chunk
>> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c6b305a89b1903d63652691ad5eb9f05aa0326b8
>> > introduced to fix this deadlock?
>>
>> With these two patches ("Btrfs: fix chunk allocation error handling"
>> and "Btrfs: don't re-enter when allocating a chunk"), I am hitting
>> ENOSPC during metadata chunk allocation.
>>
>> Upon entry into "do_chunk_alloc", I have only one METADATA block-group
>> as follows:
>> total_bytes=8388608
>> bytes_used=7938048
>> bytes_pinned=446464
>> bytes_reserved=4096
>> bytes_readonly=0
>> bytes_may_use=3362816
>>
>> As we see bytes_used+bytes_pinned+bytes_reserved==total_bytes
>>
>> What happens next is that within __btrfs_alloc_chunk():
>> - find_free_dev_extent() finds a free extent (metadata policy is SINGLE)
>> - btrfs_alloc_dev_extent() fails with ENOSPC
>>
>> (btrfs_make_block_group() is called after btrfs_alloc_dev_extent()
>> with these patches).
>>
>> What should be done in such situation, when there is not enough
>> METADATA to allocate a device extent item, but we still don't allow
>> allocating from the newly-created METADATA block group?
>>
>
> So I had a third patch that you are likely missing that makes sure we try and
> allocate chunks sooner specifically for this case
>
> 96f1bb57771f71bf1d55d5031a1cf47908494330
>
> and then Miao made it better I think with this
>
> 3c76cd84e0c0d3ceb094a1020f8c55c2417e18d3
>
> Thanks,
>
> Josef

Thank you Josef, I didn't realize that.

Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: question about transaction-abort-related commits

2013-06-30 Thread Alex Lyakas

Hi Josef,

On Wed, Jun 26, 2013 at 5:16 PM, Alex Lyakas
 wrote:
> Hi Josef,
> Can you please help me with another question.
>
> I am looking at your patch:
> Btrfs: fix chunk allocation error handling
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0448748849ef7c593be40e2c1404f7974bd3aac6
>
> Here you changed the order of btrfs_make_block_group() vs
> btrfs_alloc_dev_extent(), because we could have allocated from the
> in-memory block group, before we have inserted the dev extent into a
> tree. However, with this fix, I hit the deadlock[1] of
> btrfs_alloc_dev_extent() that also wants to allocate a chunk and
> recursively calls do_chunk_alloc, but then is stuck on chunk_mutex.
>
> Was this patch:
> Btrfs: don't re-enter when allocating a chunk
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c6b305a89b1903d63652691ad5eb9f05aa0326b8
> introduced to fix this deadlock?

With these two patches ("Btrfs: fix chunk allocation error handling"
and "Btrfs: don't re-enter when allocating a chunk"), I am hitting
ENOSPC during metadata chunk allocation.

Upon entry into "do_chunk_alloc", I have only one METADATA block-group
as follows:
total_bytes=8388608
bytes_used=7938048
bytes_pinned=446464
bytes_reserved=4096
bytes_readonly=0
bytes_may_use=3362816

As we see bytes_used+bytes_pinned+bytes_reserved==total_bytes

What happens next is that within __btrfs_alloc_chunk():
- find_free_dev_extent() finds a free extent (metadata policy is SINGLE)
- btrfs_alloc_dev_extent() fails with ENOSPC

(btrfs_make_block_group() is called after btrfs_alloc_dev_extent()
with these patches).

What should be done in such situation, when there is not enough
METADATA to allocate a device extent item, but we still don't allow
allocating from the newly-created METADATA block group?

Thanks,
Alex.




>
> Thanks,
> Alex.
>
> [1]
> [] do_chunk_alloc+0x8d/0x510 [btrfs]
> [] find_free_extent+0x9cd/0xb90 [btrfs]
> [] btrfs_reserve_extent+0xa0/0x1b0 [btrfs]
> [] btrfs_alloc_free_block+0xf9/0x570 [btrfs]
> [] __btrfs_cow_block+0x126/0x500 [btrfs]
> [] btrfs_cow_block+0x17a/0x230 [btrfs]
> [] btrfs_search_slot+0x381/0x820 [btrfs]
> [] btrfs_insert_empty_items+0x7c/0x120 [btrfs]
> [] btrfs_alloc_dev_extent+0x9b/0x1c0 [btrfs]
> [] __btrfs_alloc_chunk+0x58a/0x850 [btrfs]
> [] btrfs_alloc_chunk+0xbf/0x160 [btrfs]
> [] do_chunk_alloc+0x32b/0x510 [btrfs]
> [] find_free_extent+0x9cd/0xb90 [btrfs]
> [] btrfs_reserve_extent+0xa0/0x1b0 [btrfs]
> [] btrfs_alloc_free_block+0xf9/0x570 [btrfs]
> [] __btrfs_cow_block+0x126/0x500 [btrfs]
> [] btrfs_cow_block+0x17a/0x230 [btrfs]
> [] push_leaf_right+0x133/0x1a0 [btrfs]
> [] split_leaf+0x5e1/0x770 [btrfs]
> [] btrfs_search_slot+0x785/0x820 [btrfs]
> [] lookup_inline_extent_backref+0x8e/0x5b0 [btrfs]
> [] insert_inline_extent_backref+0x63/0x130 [btrfs]
> [] __btrfs_inc_extent_ref+0x9f/0x240 [btrfs]
> [] run_clustered_refs+0x971/0xd00 [btrfs]
> [] btrfs_run_delayed_refs+0xd0/0x330 [btrfs]
> [] __btrfs_end_transaction+0xf7/0x440 [btrfs]
> [] btrfs_end_transaction+0x10/0x20 [btrfs]
>
>
>
>
> On Mon, Jun 24, 2013 at 9:56 PM, Alex Lyakas
>  wrote:
>>
>> Thanks for commenting Josef. I hope your head will get better:)
>> Actually, while re-looking at the code, I see that there are couple of
>> "goto cleanup;", before we ensure that all the writers have detached
>> from the committing transaction. So Liu's code is still needed, looks
>> like.
>>
>> Thanks,
>> Alex.
>>
>> On Mon, Jun 24, 2013 at 7:24 PM, Josef Bacik  wrote:
>> > On Sun, Jun 23, 2013 at 09:52:14PM +0300, Alex Lyakas wrote:
>> >> Hello Josef, Liu,
>> >> I am reviewing commits in the mainline tree:
>> >>
>> >> e4a2bcaca9643e7430207810653222fc5187f2be Btrfs: if we aren't
>> >> committing just end the transaction if we error out
>> >> (call end_transaction() instead of goto cleanup_transaction) - Josef
>> >>
>> >> f094ac32aba3a51c00e970a2ea029339af2ca048 Btrfs: fix NULL pointer after
>> >> aborting a transaction
>> >> (wait until all writers detach, before setting running_transaction to
>> >> NULL) - Liu
>> >>
>> >> 66b6135b7cf741f6f44ba938b27583ea3b83bd12 Btrfs: avoid deadlock on
>> >> transaction waiting list
>> >> (check if transaction was already removed from the transactions list) -
>> >> Liu
>> >>
>> >> Josef, according to your fix, if the committer encounters a problem
>> >> early, it just doesn't commit. Instead it aborts the transaction
>> >> (possibly setti

Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-06-26 Thread Alex Lyakas

Hi Miao,

On Mon, Jun 17, 2013 at 4:51 AM, Miao Xie  wrote:
> On  sun, 16 Jun 2013 13:38:42 +0300, Alex Lyakas wrote:
>> Hi Miao,
>>
>> On Thu, Jun 13, 2013 at 6:08 AM, Miao Xie  wrote:
>>> On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote:
>>>> I reviewed the code starting from:
>>>> 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
>>>> the transaction commit
>>>> until
>>>> 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()
>>>>
>>>> It looks very good. Let me check if I understand the fix correctly:
>>>> # When transaction starts to commit, we want to wait only for external
>>>> writers (those that did ATTACH/START/USERSPACE)
>>>> # We guarantee at this point that no new external writers will hop on
>>>> the committing transaction, by setting ->blocked state, so we only
>>>> wait for existing extwriters to detach from transaction
>>
>> I have a doubt about this point with your new code. Example:
>> Task1 - external writer
>> Task2 - transaction kthread
>>
>> Task1   Task2
>> |start_transaction(TRANS_START)   |
>> |-wait_current_trans(blocked=0, so it doesn't wait) |
>> |-join_transaction()  |
>> |--lock(trans_lock)   |
>> |--can_join_transaction() YES  |
>> |
>>   |-btrfs_commit_transaction()
>> |
>>   |--blocked=1
>> |
>>   |--in_commit=1
>> |
>>   |--wait_event(extwriter== 0);
>> |
>>   |--btrfs_flush_all_pending_stuffs()
>> |
>> |
>> |--extwriter_counter_inc() |
>> |--unlock(trans_lock)   |
>> |
>>   | lock(trans_lock)
>> |
>>   | trans_no_join=1
>>
>> Basically, the "blocked/in_commit" check is not synchronized with
>> joining a transaction. After checking "blocked", the external writer
>> may proceed and join the transaction. Right before joining, it calls
>> can_join_transaction(). But this function checks in_commit flag under
>> fs_info->trans_lock. But btrfs_commit_transaction() sets this flag not
>> under trans_lock, but under commit_lock, so checking this flag is not
>> synchronized.
>>
>> Or maybe I am wrong, because btrfs_commit_transaction() locks and
>> unlocks trans_lock to check for previous transaction, so by accident
>> there is no problem, and above scenario cannot happen?
>
> Your analysis at the last section is right, so the right process is:
>
> Task1   Task2
> |start_transaction(TRANS_START) |
> |-wait_current_trans(blocked=0, so it doesn't wait) |
> |-join_transaction()|
> |--lock(trans_lock) |
> |--can_join_transaction() YES   |
> |   
> |-btrfs_commit_transaction()
> |   |--blocked=1
> |   |--in_commit=1
> |--extwriter_counter_inc()  |
> |--unlock(trans_lock)   |
> |   |--lock(trans_lock)
> |   |--...
> |   |--unlock(trans_lock)
> |   |--...
> |   
> |--wait_event(extwriter== 0);
> |   
> |--btrfs_flush_all_pending_stuffs()
>
> The problem you worried can not happen.
>
> Anyway, it is not good that the "blocked/in_commit" check is not synchronized 
> with
> joining a transaction. So I modified the relative code in this patchset.
>

The four patches that we applied related to extwriters issue work very
good. They definitely solve the non-deterministic behavior while
waiting for the writers to detach. Thanks for addressing this issue.
One note is that the new behavior is perhaps less "friendly" to the
transaction join flow. With your fix, the committer unconditionally
sets

Re: question about transaction-abort-related commits

2013-06-26 Thread Alex Lyakas

Hi Josef,
Can you please help me with another question.

I am looking at your patch:
Btrfs: fix chunk allocation error handling
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0448748849ef7c593be40e2c1404f7974bd3aac6

Here you changed the order of btrfs_make_block_group() vs
btrfs_alloc_dev_extent(), because we could have allocated from the
in-memory block group, before we have inserted the dev extent into a
tree. However, with this fix, I hit the deadlock[1] of
btrfs_alloc_dev_extent() that also wants to allocate a chunk and
recursively calls do_chunk_alloc, but then is stuck on chunk_mutex.

Was this patch:
Btrfs: don't re-enter when allocating a chunk
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c6b305a89b1903d63652691ad5eb9f05aa0326b8
introduced to fix this deadlock?

Thanks,
Alex.

[1]
[] do_chunk_alloc+0x8d/0x510 [btrfs]
[] find_free_extent+0x9cd/0xb90 [btrfs]
[] btrfs_reserve_extent+0xa0/0x1b0 [btrfs]
[] btrfs_alloc_free_block+0xf9/0x570 [btrfs]
[] __btrfs_cow_block+0x126/0x500 [btrfs]
[] btrfs_cow_block+0x17a/0x230 [btrfs]
[] btrfs_search_slot+0x381/0x820 [btrfs]
[] btrfs_insert_empty_items+0x7c/0x120 [btrfs]
[] btrfs_alloc_dev_extent+0x9b/0x1c0 [btrfs]
[] __btrfs_alloc_chunk+0x58a/0x850 [btrfs]
[] btrfs_alloc_chunk+0xbf/0x160 [btrfs]
[] do_chunk_alloc+0x32b/0x510 [btrfs]
[] find_free_extent+0x9cd/0xb90 [btrfs]
[] btrfs_reserve_extent+0xa0/0x1b0 [btrfs]
[] btrfs_alloc_free_block+0xf9/0x570 [btrfs]
[] __btrfs_cow_block+0x126/0x500 [btrfs]
[] btrfs_cow_block+0x17a/0x230 [btrfs]
[] push_leaf_right+0x133/0x1a0 [btrfs]
[] split_leaf+0x5e1/0x770 [btrfs]
[] btrfs_search_slot+0x785/0x820 [btrfs]
[] lookup_inline_extent_backref+0x8e/0x5b0 [btrfs]
[] insert_inline_extent_backref+0x63/0x130 [btrfs]
[] __btrfs_inc_extent_ref+0x9f/0x240 [btrfs]
[] run_clustered_refs+0x971/0xd00 [btrfs]
[] btrfs_run_delayed_refs+0xd0/0x330 [btrfs]
[] __btrfs_end_transaction+0xf7/0x440 [btrfs]
[] btrfs_end_transaction+0x10/0x20 [btrfs]




On Mon, Jun 24, 2013 at 9:56 PM, Alex Lyakas
 wrote:
>
> Thanks for commenting Josef. I hope your head will get better:)
> Actually, while re-looking at the code, I see that there are couple of
> "goto cleanup;", before we ensure that all the writers have detached
> from the committing transaction. So Liu's code is still needed, looks
> like.
>
> Thanks,
> Alex.
>
> On Mon, Jun 24, 2013 at 7:24 PM, Josef Bacik  wrote:
> > On Sun, Jun 23, 2013 at 09:52:14PM +0300, Alex Lyakas wrote:
> >> Hello Josef, Liu,
> >> I am reviewing commits in the mainline tree:
> >>
> >> e4a2bcaca9643e7430207810653222fc5187f2be Btrfs: if we aren't
> >> committing just end the transaction if we error out
> >> (call end_transaction() instead of goto cleanup_transaction) - Josef
> >>
> >> f094ac32aba3a51c00e970a2ea029339af2ca048 Btrfs: fix NULL pointer after
> >> aborting a transaction
> >> (wait until all writers detach, before setting running_transaction to
> >> NULL) - Liu
> >>
> >> 66b6135b7cf741f6f44ba938b27583ea3b83bd12 Btrfs: avoid deadlock on
> >> transaction waiting list
> >> (check if transaction was already removed from the transactions list) -
> >> Liu
> >>
> >> Josef, according to your fix, if the committer encounters a problem
> >> early, it just doesn't commit. Instead it aborts the transaction
> >> (possibly setting FS to read-only) and detaches from the transaction.
> >> So if this was the only committer (e.g., the transaction_kthread),
> >> then transaction commit will not happen at all. Is this what you
> >> intended? So then the user will notice that FS went read-only, and she
> >> will unmount the FS, and transaction will be cleaned up in
> >> btrfs_error_commit_super()=>btrfs_cleanup_transaction(), and not in
> >> cleanup_transaction() via btrfs_commit_transaction(). Is my
> >> understanding correct?
> >>
> >> Liu, it looks like after having Josef's fix, the above two commits of
> >> yours are not strictly needed, right? Because Josef's fix ensures that
> >> only the "real" committer will call cleanup_transaction(), so at this
> >> point there is only one writer attached to the transaction, which is
> >> the committer itself (your fixes doesn't hurt though). Is that
> >> correct?
> >>
> >
> > I've looked at the patches and I'm going to say yes with the caveat that
> > I
> > stopped thinking about it when my head started hurting :).  Thanks,
> >
> > Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: question about transaction-abort-related commits

2013-06-24 Thread Alex Lyakas

Thanks for commenting Josef. I hope your head will get better:)
Actually, while re-looking at the code, I see that there are couple of
"goto cleanup;", before we ensure that all the writers have detached
from the committing transaction. So Liu's code is still needed, looks
like.

Thanks,
Alex.

On Mon, Jun 24, 2013 at 7:24 PM, Josef Bacik  wrote:
> On Sun, Jun 23, 2013 at 09:52:14PM +0300, Alex Lyakas wrote:
>> Hello Josef, Liu,
>> I am reviewing commits in the mainline tree:
>>
>> e4a2bcaca9643e7430207810653222fc5187f2be Btrfs: if we aren't
>> committing just end the transaction if we error out
>> (call end_transaction() instead of goto cleanup_transaction) - Josef
>>
>> f094ac32aba3a51c00e970a2ea029339af2ca048 Btrfs: fix NULL pointer after
>> aborting a transaction
>> (wait until all writers detach, before setting running_transaction to
>> NULL) - Liu
>>
>> 66b6135b7cf741f6f44ba938b27583ea3b83bd12 Btrfs: avoid deadlock on
>> transaction waiting list
>> (check if transaction was already removed from the transactions list) - Liu
>>
>> Josef, according to your fix, if the committer encounters a problem
>> early, it just doesn't commit. Instead it aborts the transaction
>> (possibly setting FS to read-only) and detaches from the transaction.
>> So if this was the only committer (e.g., the transaction_kthread),
>> then transaction commit will not happen at all. Is this what you
>> intended? So then the user will notice that FS went read-only, and she
>> will unmount the FS, and transaction will be cleaned up in
>> btrfs_error_commit_super()=>btrfs_cleanup_transaction(), and not in
>> cleanup_transaction() via btrfs_commit_transaction(). Is my
>> understanding correct?
>>
>> Liu, it looks like after having Josef's fix, the above two commits of
>> yours are not strictly needed, right? Because Josef's fix ensures that
>> only the "real" committer will call cleanup_transaction(), so at this
>> point there is only one writer attached to the transaction, which is
>> the committer itself (your fixes doesn't hurt though). Is that
>> correct?
>>
>
> I've looked at the patches and I'm going to say yes with the caveat that I
> stopped thinking about it when my head started hurting :).  Thanks,
>
> Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: make delayed ref lock logic more readable

2013-06-24 Thread Alex Lyakas

Hi Miao,
I believe the name of this patch is misleading. The significant
contribution of this patch IMO is the btrfs_release_ref_cluster(),
which is being called when btrfs_run_delayed_refs() aborts the
transaction. Without this fix, btrfs_destroy_delayed_refs() was
crashing when it was doing:
list_del_init(&head->cluster);

Because the head of the list was a local variable in some stack frame
that is long gone...

So I had weird kernel crashes like:
kernel: [  385.668132] kernel tried to execute NX-protected page -
exploit attempt? (uid: 0)
kernel: [  385.669583] BUG: unable to handle kernel paging request at
8800487abd68

Thanks anyways for fixing this.

Alex.



On Wed, Dec 19, 2012 at 10:10 AM, Miao Xie  wrote:
> Locking and unlocking delayed ref mutex are in the different functions,
> and the name of lock functions is not uniform, so the readability is not
> so good, this patch optimizes the lock logic and makes it more readable.
>
> Signed-off-by: Miao Xie 
> ---
>  fs/btrfs/delayed-ref.c |  8 
>  fs/btrfs/delayed-ref.h |  6 ++
>  fs/btrfs/extent-tree.c | 42 --
>  3 files changed, 38 insertions(+), 18 deletions(-)
>
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 455894f..b7a0641 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -426,6 +426,14 @@ again:
> return 1;
>  }
>
> +void btrfs_release_ref_cluster(struct list_head *cluster)
> +{
> +   struct list_head *pos, *q;
> +
> +   list_for_each_safe(pos, q, cluster)
> +   list_del_init(pos);
> +}
> +
>  /*
>   * helper function to update an extent delayed ref in the
>   * rbtree.  existing and update must both have the same
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index fe50392..7939149 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -211,8 +211,14 @@ struct btrfs_delayed_ref_head *
>  btrfs_find_delayed_ref_head(struct btrfs_trans_handle *trans, u64 bytenr);
>  int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
>struct btrfs_delayed_ref_head *head);
> +static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head 
> *head)
> +{
> +   mutex_unlock(&head->mutex);
> +}
> +
>  int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
>struct list_head *cluster, u64 search_start);
> +void btrfs_release_ref_cluster(struct list_head *cluster);
>
>  int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
> struct btrfs_delayed_ref_root *delayed_refs,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ae3c24a..b6ed965 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2143,7 +2143,6 @@ static int run_one_delayed_ref(struct 
> btrfs_trans_handle *trans,
>   node->num_bytes);
> }
> }
> -   mutex_unlock(&head->mutex);
> return ret;
> }
>
> @@ -2258,7 +2257,7 @@ static noinline int run_clustered_refs(struct 
> btrfs_trans_handle *trans,
>  * process of being added. Don't run this ref yet.
>  */
> list_del_init(&locked_ref->cluster);
> -   mutex_unlock(&locked_ref->mutex);
> +   btrfs_delayed_ref_unlock(locked_ref);
> locked_ref = NULL;
> delayed_refs->num_heads_ready++;
> spin_unlock(&delayed_refs->lock);
> @@ -2297,25 +2296,22 @@ static noinline int run_clustered_refs(struct 
> btrfs_trans_handle *trans,
> btrfs_free_delayed_extent_op(extent_op);
>
> if (ret) {
> -   list_del_init(&locked_ref->cluster);
> -   mutex_unlock(&locked_ref->mutex);
> -
> -   printk(KERN_DEBUG "btrfs: 
> run_delayed_extent_op returned %d\n", ret);
> +   printk(KERN_DEBUG
> +  "btrfs: run_delayed_extent_op "
> +  "returned %d\n", ret);
> spin_lock(&delayed_refs->lock);
> +   btrfs_delayed_ref_unlock(locked_ref);
> return ret;
> }
>
> goto next;
> }
> -
> -   list_del_init(&locked_ref->cluster);
> -   locked_ref = NULL;
> }
>
> ref->in_tree = 0;
> rb_erase(&ref->rb_node, &delayed_refs->root);
> delayed_refs->num_entries--;
> -   if (locked_ref)

question about transaction-abort-related commits

2013-06-23 Thread Alex Lyakas

Hello Josef, Liu,
I am reviewing commits in the mainline tree:

e4a2bcaca9643e7430207810653222fc5187f2be Btrfs: if we aren't
committing just end the transaction if we error out
(call end_transaction() instead of goto cleanup_transaction) - Josef

f094ac32aba3a51c00e970a2ea029339af2ca048 Btrfs: fix NULL pointer after
aborting a transaction
(wait until all writers detach, before setting running_transaction to
NULL) - Liu

66b6135b7cf741f6f44ba938b27583ea3b83bd12 Btrfs: avoid deadlock on
transaction waiting list
(check if transaction was already removed from the transactions list) - Liu

Josef, according to your fix, if the committer encounters a problem
early, it just doesn't commit. Instead it aborts the transaction
(possibly setting FS to read-only) and detaches from the transaction.
So if this was the only committer (e.g., the transaction_kthread),
then transaction commit will not happen at all. Is this what you
intended? So then the user will notice that FS went read-only, and she
will unmount the FS, and transaction will be cleaned up in
btrfs_error_commit_super()=>btrfs_cleanup_transaction(), and not in
cleanup_transaction() via btrfs_commit_transaction(). Is my
understanding correct?

Liu, it looks like after having Josef's fix, the above two commits of
yours are not strictly needed, right? Because Josef's fix ensures that
only the "real" committer will call cleanup_transaction(), so at this
point there is only one writer attached to the transaction, which is
the committer itself (your fixes doesn't hurt though). Is that
correct?

Thanks for helping,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-06-16 Thread Alex Lyakas

Hi Miao,

On Thu, Jun 13, 2013 at 6:08 AM, Miao Xie  wrote:
> On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote:
>> I reviewed the code starting from:
>> 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
>> the transaction commit
>> until
>> 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()
>>
>> It looks very good. Let me check if I understand the fix correctly:
>> # When transaction starts to commit, we want to wait only for external
>> writers (those that did ATTACH/START/USERSPACE)
>> # We guarantee at this point that no new external writers will hop on
>> the committing transaction, by setting ->blocked state, so we only
>> wait for existing extwriters to detach from transaction

I have a doubt about this point with your new code. Example:
Task1 - external writer
Task2 - transaction kthread

Task1   Task2
|start_transaction(TRANS_START)   |
|-wait_current_trans(blocked=0, so it doesn't wait) |
|-join_transaction()  |
|--lock(trans_lock)   |
|--can_join_transaction() YES  |
|
  |-btrfs_commit_transaction()
|
  |--blocked=1
|
  |--in_commit=1
|
  |--wait_event(extwriter== 0);
|
  |--btrfs_flush_all_pending_stuffs()
||
|--extwriter_counter_inc() |
|--unlock(trans_lock)   |
|
  | lock(trans_lock)
|
  | trans_no_join=1

Basically, the "blocked/in_commit" check is not synchronized with
joining a transaction. After checking "blocked", the external writer
may proceed and join the transaction. Right before joining, it calls
can_join_transaction(). But this function checks in_commit flag under
fs_info->trans_lock. But btrfs_commit_transaction() sets this flag not
under trans_lock, but under commit_lock, so checking this flag is not
synchronized.

Or maybe I am wrong, because btrfs_commit_transaction() locks and
unlocks trans_lock to check for previous transaction, so by accident
there is no problem, and above scenario cannot happen?


>> # We do not care at this point for TRANS_JOIN etc, we let them hop on
>> if they want
>> # When all external writers have detached, we flush their delalloc and
>> then we prevent all the others to join (TRANS_JOIN etc)
>>
>> # Previously, we had the do-while loop, that intended to do the same,
>> but it used num_writers, which counts both external writers and also
>> TRANS_JOIN. So the loop was racy because new joins prevented it from
>> completing.
>>
>> Is my understanding correct?
>
> Yes, you are right.
>
>> I have some questions:
>> # Why was the do-while loop needed? Can we just delete the do-while
>> loop as it was before, call flush_all_pending stuffs(),  then set
>> trans_no_join and wait for all writers to detach? Is there some
>> correctness problem here?
>> Or we need to wait for external writers to detach before calling
>> flush_all_pending_stuffs() one last time?
>
> The external writers will introduce pending works, we need flush them
> after they detach, otherwise we will forget to deal with them at the current
> transaction just like the following case:
>
> Task1   Task2
> start_transaction
> commit_transaction
>   flush_all_pending_stuffs
> add pending works
> end_transaction
>   ...
>
>
>> # Why TRANS_ATTACH is considered external writer?
>
> - at most cases, it is done by the users' operations.
> - if in_commit is set, we shouldn't start it, or the deadlock will happen.
>   it is the same as TRANS_START/TRANS_USERSPACE.
>
>> # Can I apply this fix to 3.8.x kernel (manually, of course)? Or some
>> additional things are needed that are missing in this kernel?
>
> Yes, you can rebase it against 3.8.x kernel freely.
>
> Thanks
> Miao

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-06-12 Thread Alex Lyakas

Hi Miao,

On Thu, May 9, 2013 at 10:57 AM, Miao Xie  wrote:
> Hi, Alex
>
> Could you try the following patchset?
>
>   git://github.com/miaoxie/linux-btrfs.git trans-commit-improve
>
> I think it can avoid the problem you said below.
>
> Note: this patchset is against chris's for-linus branch.

I reviewed the code starting from:
69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
the transaction commit
until
2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()

It looks very good. Let me check if I understand the fix correctly:
# When transaction starts to commit, we want to wait only for external
writers (those that did ATTACH/START/USERSPACE)
# We guarantee at this point that no new external writers will hop on
the committing transaction, by setting ->blocked state, so we only
wait for existing extwriters to detach from transaction
# We do not care at this point for TRANS_JOIN etc, we let them hop on
if they want
# When all external writers have detached, we flush their delalloc and
then we prevent all the others to join (TRANS_JOIN etc)

# Previously, we had the do-while loop, that intended to do the same,
but it used num_writers, which counts both external writers and also
TRANS_JOIN. So the loop was racy because new joins prevented it from
completing.

Is my understanding correct?

I have some questions:
# Why was the do-while loop needed? Can we just delete the do-while
loop as it was before, call flush_all_pending stuffs(),  then set
trans_no_join and wait for all writers to detach? Is there some
correctness problem here?
Or we need to wait for external writers to detach before calling
flush_all_pending_stuffs() one last time?

# Why TRANS_ATTACH is considered external writer?

# Can I apply this fix to 3.8.x kernel (manually, of course)? Or some
additional things are needed that are missing in this kernel?

Thanks,
Alex.

>
> Thanks
> Miao
>
> On Wed, 10 Apr 2013 21:45:43 +0300, Alex Lyakas wrote:
>> Hi Miao,
>> I attempted to fix the issue by not joining a transaction that has
>> trans->in_commit set. I did something similar to what
>> wait_current_trans() does, but I did:
>>
>> smp_rmb();
>> if (cur_trans && cur_trans->in_commit) {
>> ...
>> wait_event(root->fs_info->transaction_wait,  !cur_trans->blocked);
>> ...
>>
>> I also had to change the order of setting in_commit and blocked in
>> btrfs_commit_transaction:
>>   trans->transaction->blocked = 1;
>>   trans->transaction->in_commit = 1;
>>   smp_wmb();
>> to make sure that if in_commit is set, then blocked cannot be 0,
>> because btrfs_commit_transaction haven't set it yet to 1.
>>
>> However, with this fix I observe two issues:
>> # With large trees and heavy commits, join_transaction() is delayed
>> sometimes by 1-3 seconds. This delays the host IO by too much.
>> # With this fix, I think too many transactions happen. Basically with
>> this fix, once transaction->in_commit is set, then I insist to open a
>> new transaction and not to join the current one. It has some bad
>> influence on host response times pattern, but I cannot exactly tell
>> why is that.
>>
>> Did you have other fix in mind?
>>
>> Without the fix, I observe sometimes commits that take like 80
>> seconds, out of which like 50 seconds are spent in the do-while loop
>> of btrfs_commit_transaction.
>>
>> Thanks,
>> Alex.
>>
>>
>>
>> On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas
>>  wrote:
>>> Hi Miao,
>>>
>>> On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie  wrote:
>>>> On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
>>>>> Hi Miao,
>>>>> I am seeing another issue. Your fix prevents from TRANS_START to get
>>>>> in the way of a committing transaction. But it does not prevent from
>>>>> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
>>>>> following loop:
>>>>>
>>>>> do {
>>>>> // attempt to do some useful stuff and/or sleep
>>>>> } while (atomic_read(&cur_trans->num_writers) > 1 ||
>>>>>(should_grow && cur_trans->num_joined != joined));
>>>>>
>>>>> What I see is basically that new writers join the transaction, while
>>>>> btrfs_commit_transaction() does this loop. I see
>>>>> cur_trans->num_writers decreasing, but then it increases, then
>>>>> decreases etc. This can go for several seconds during heavy IO load.
>>>>> There is nothing to prev

wait_block_group_cache_progress() waits forever in case of drive failure

2013-06-04 Thread Alex Lyakas

Greetings all,
when testing drive failures, I occasionally hit the following hang:

# Block group is being cached-in by caching_thread()
# caching_thread() experiences an error, e.g., in btrfs_search_slot,
because of drive failure:
ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
if (ret < 0)
goto err;

# caching thread exits:
err:
btrfs_free_path(path);
up_read(&fs_info->extent_commit_sem);

free_excluded_extents(extent_root, block_group);

mutex_unlock(&caching_ctl->mutex);
out:
wake_up(&caching_ctl->wait);

put_caching_control(caching_ctl);
btrfs_put_block_group(block_group);

However, wait_block_group_cache_progress() is still stuck in a stack like this:
[] schedule+0x29/0x70
[] wait_block_group_cache_progress+0xe2/0x110 [btrfs]
[] ? add_wait_queue+0x60/0x60
[] ? add_wait_queue+0x60/0x60
[] find_free_extent+0x306/0xb90 [btrfs]
[] ? btrfs_search_slot+0x2fe/0x820 [btrfs]
[] btrfs_reserve_extent+0xa0/0x1b0 [btrfs]
...
because of:
wait_event(caching_ctl->wait, block_group_cache_done(cache) ||
   (cache->free_space_ctl->free_space >= num_bytes));

But cache->cached never becomes BTRFS_CACHE_FINISHED, and
cache->free_space_ctl->free_space will also not grow enough, so the
wait never finishes.
At this point, the system totally hangs.

Same problem can happen with wait_block_group_cache_done().

I am thinking: can we add additional condition, like:
wait_event(caching_ctl->wait,
   test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) ||
   block_group_cache_done(cache) ||
   (cache->free_space_ctl->free_space >= num_bytes));

So that when transaction aborts, FS is marked as "bad", and then all
these waits will complete, so that the user can unmount?

Or some other way to fix this problem?

Thanks,
Alex.

P.S: should I open a bugzilla for this?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[no subject]

2013-05-28 Thread Alex Lyakas

Hello all,
I have the following unresponsive btrfs:

btrfs_end_transaction() is called and is stuck in btrfs_tree_lock():

May 27 16:13:55 vc kernel: [ 7130.421159] kworker/u:85D
 0 19859  2 0x
May 27 16:13:55 vc kernel: [ 7130.421159]  880095335568
0046 00010093cb38 880083b11b48
May 27 16:13:55 vc kernel: [ 7130.421159]  880095335fd8
880095335fd8 880095335fd8 00013f40
May 27 16:13:55 vc kernel: [ 7130.421159]  8800a1fddd00
88008b1fc5c0 880095335578 880090f736d8
May 27 16:13:55 vc kernel: [ 7130.421159] Call Trace:
May 27 16:13:55 vc kernel: [ 7130.421159]  []
schedule+0x29/0x70
May 27 16:13:55 vc kernel: [ 7130.421159]  []
btrfs_tree_lock+0xcd/0x250 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [] ?
add_wait_queue+0x60/0x60
May 27 16:13:55 vc kernel: [ 7130.421159]  []
btrfs_init_new_buffer+0x68/0x140 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
btrfs_alloc_free_block+0xdd/0x460 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [] ?
__set_page_dirty_nobuffers+0x1b/0x20
May 27 16:13:55 vc kernel: [ 7130.421159]  [] ?
btree_set_page_dirty+0xe/0x10 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
__btrfs_cow_block+0x126/0x4f0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
btrfs_cow_block+0x123/0x1d0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
btrfs_search_slot+0x381/0x820 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
lookup_inline_extent_backref+0x8e/0x5b0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [] ?
btrfs_mark_buffer_dirty+0x99/0xf0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [] ?
setup_inline_extent_backref+0x18e/0x290 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
insert_inline_extent_backref+0x63/0x130 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [] ?
btrfs_alloc_path+0x1a/0x20 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
__btrfs_inc_extent_ref+0x9f/0x240 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [] ?
btrfs_merge_delayed_refs+0x289/0x300 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
run_clustered_refs+0x971/0xd00 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [] ?
btrfs_put_tree_mod_seq+0x10d/0x150 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
btrfs_run_delayed_refs+0xd0/0x320 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
__btrfs_end_transaction+0xf7/0x410 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  []
btrfs_end_transaction+0x10/0x20 [btrfs]

As a result, transaction cannot commit, it waits for all writers to
detach in the do-while loop.

May 27 16:13:55 vc kernel: [ 7130.419009] btrfs-transacti D
 0 15150  2 0x
May 27 16:13:55 vc kernel: [ 7130.419012]  88009f86bce8
0046 032d032d 
May 27 16:13:55 vc kernel: [ 7130.419016]  88009f86bfd8
88009f86bfd8 88009f86bfd8 00013f40
May 27 16:13:55 vc kernel: [ 7130.419020]  8800af1e9740
8800a03f8000 0090 88009693cb00
May 27 16:13:55 vc kernel: [ 7130.419023] Call Trace:
May 27 16:13:55 vc kernel: [ 7130.419027]  []
schedule+0x29/0x70
May 27 16:13:55 vc kernel: [ 7130.419031]  []
schedule_timeout+0x1ed/0x250
May 27 16:13:55 vc kernel: [ 7130.419055]  [] ?
btrfs_run_ordered_operations+0x2b3/0x2e0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.419060]  [] ?
default_spin_lock_flags+0x9/0x10
May 27 16:13:55 vc kernel: [ 7130.419081]  []
btrfs_commit_transaction+0x3b8/0xae0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.419085]  [] ?
add_wait_queue+0x60/0x60
May 27 16:13:55 vc kernel: [ 7130.419104]  []
transaction_kthread+0x1b5/0x230 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.419124]  [] ?
btree_invalidatepage+0x80/0x80 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.419128]  []
kthread+0xc0/0xd0
May 27 16:13:55 vc kernel: [ 7130.419132]  [] ?
flush_kthread_worker+0xb0/0xb0
May 27 16:13:55 vc kernel: [ 7130.419136]  []
ret_from_fork+0x7c/0xb0
May 27 16:13:55 vc kernel: [ 7130.419140]  [] ?
flush_kthread_worker+0xb0/0xb0

There is additional thread stuck in btrfs_tree_lock(), not sure how it
is related, perhaps there's some deadlock between the two?

May 27 16:13:55 vc kernel: [ 7130.421159] flush-btrfs-2   D
0001 0 18816  2 0x
May 27 16:13:55 vc kernel: [ 7130.421159]  88008b553948
0046 880017991050 
May 27 16:13:55 vc kernel: [ 7130.421159]  88008b553fd8
88008b553fd8 88008b553fd8 00013f40
May 27 16:13:55 vc kernel: [ 7130.421159]  880119b11740
8800af86 88008b553958 880090c9d988
May 27 16:13:55 vc kernel: [ 7130.421159] Call Trace:
May 27 16:13:55 vc kernel: [ 7130.421159]  []
schedule+0x29/0x70
May 27 16:13:55 vc kernel: [ 7130.421159]  []
btrfs_tree_lock+0xcd/0x250 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [] ?
add_wait_queue+0x60/0x60
May 27 16:13:55 vc kernel: [ 7130.421159]  []
btree_write_cache_pages+0x3bc/0x880 [btrfs]
May 27 16:13:55

Re: [PATCH] Btrfs: clear received_uuid field for new writable snapshots

2013-05-22 Thread Alex Lyakas

Hi Stephan,
I fully understand the first part of your fix, and I believe it's
quite critical. Indeed, a writable snapshot should have no evidence
that it has an ancestor that was once "received".

Can you pls let me know that I understand the second part of your fix.
In btrfs-progs the following code in tree_search() would have
prevented us from mistakenly selecting such snapshot as a parent for
"receive":
if (type == subvol_search_by_received_uuid) {
entry = rb_entry(n, struct subvol_info,
rb_received_node);
comp = memcmp(entry->received_uuid, uuid,
BTRFS_UUID_SIZE);
if (!comp) {
if (entry->stransid < stransid)
comp = -1;
else if (entry->stransid > stransid)
comp = 1;
else
comp = 0;
}
The code checks both received_uuid (which would have been wrongly
equal to what we need), but also the stransid (which was the ctransid
on the send side), which would have been zero, so it wouldn't match.

Now after your fix, the stransid field becomes not needed, correct?
Because if we have a valid received_uuid, this means that either we
are the "received" snapshot, or our whole chain of ancestors are
read-only, and eventually there was an ancestor that was "received".
So we have valid data and can be used as a parent. Is it still needed
after your fix to check the stransid field ? (it doesn't hurt to check
it)

Clearring/Not clearing the rtransid - does it bring any value?
rtransid is the local transid of when we had completed the "receive"
process for this snap. Is there any interesting usage of this value?

Thanks,
Alex.

On Wed, Apr 17, 2013 at 12:11 PM, Stefan Behrens
 wrote:
>
> For created snapshots, the full root_item is copied from the source
> root and afterwards selectively modified. The current code forgets
> to clear the field received_uuid. The only problem is that it is
> confusing when you look at it with 'btrfs subv list', since for
> writable snapshots, the contents of the snapshot can be completely
> unrelated to the previously received snapshot.
> The receiver ignores such snapshots anyway because he also checks
> the field stransid in the root_item and that value used to be reset
> to zero for all created snapshots.
>
> This commit changes two things:
> - clear the received_uuid field for new writable snapshots.
> - don't clear the send/receive related information like the stransid
>   for read-only snapshots (which makes them useable as a parent for
>   the automatic selection of parents in the receive code).
>
> Signed-off-by: Stefan Behrens 
> ---
>  fs/btrfs/transaction.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index ffac232..94cbd10 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1170,13 +1170,17 @@ static noinline int create_pending_snapshot(struct 
> btrfs_trans_handle *trans,
> memcpy(new_root_item->uuid, new_uuid.b, BTRFS_UUID_SIZE);
> memcpy(new_root_item->parent_uuid, root->root_item.uuid,
> BTRFS_UUID_SIZE);
> +   if (!(root_flags & BTRFS_ROOT_SUBVOL_RDONLY)) {
> +   memset(new_root_item->received_uuid, 0,
> +  sizeof(new_root_item->received_uuid));
> +   memset(&new_root_item->stime, 0, 
> sizeof(new_root_item->stime));
> +   memset(&new_root_item->rtime, 0, 
> sizeof(new_root_item->rtime));
> +   btrfs_set_root_stransid(new_root_item, 0);
> +   btrfs_set_root_rtransid(new_root_item, 0);
> +   }
> new_root_item->otime.sec = cpu_to_le64(cur_time.tv_sec);
> new_root_item->otime.nsec = cpu_to_le32(cur_time.tv_nsec);
> btrfs_set_root_otransid(new_root_item, trans->transid);
> -   memset(&new_root_item->stime, 0, sizeof(new_root_item->stime));
> -   memset(&new_root_item->rtime, 0, sizeof(new_root_item->rtime));
> -   btrfs_set_root_stransid(new_root_item, 0);
> -   btrfs_set_root_rtransid(new_root_item, 0);
>
> old = btrfs_lock_root_node(root);
> ret = btrfs_cow_block(trans, root, old, NULL, 0, &old);
> --
> 1.8.2.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-04-10 Thread Alex Lyakas

Hi Miao,
I attempted to fix the issue by not joining a transaction that has
trans->in_commit set. I did something similar to what
wait_current_trans() does, but I did:

smp_rmb();
if (cur_trans && cur_trans->in_commit) {
...
wait_event(root->fs_info->transaction_wait,  !cur_trans->blocked);
...

I also had to change the order of setting in_commit and blocked in
btrfs_commit_transaction:
trans->transaction->blocked = 1;
trans->transaction->in_commit = 1;
smp_wmb();
to make sure that if in_commit is set, then blocked cannot be 0,
because btrfs_commit_transaction haven't set it yet to 1.

However, with this fix I observe two issues:
# With large trees and heavy commits, join_transaction() is delayed
sometimes by 1-3 seconds. This delays the host IO by too much.
# With this fix, I think too many transactions happen. Basically with
this fix, once transaction->in_commit is set, then I insist to open a
new transaction and not to join the current one. It has some bad
influence on host response times pattern, but I cannot exactly tell
why is that.

Did you have other fix in mind?

Without the fix, I observe sometimes commits that take like 80
seconds, out of which like 50 seconds are spent in the do-while loop
of btrfs_commit_transaction.

Thanks,
Alex.



On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas
 wrote:
> Hi Miao,
>
> On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie  wrote:
>> On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
>>> Hi Miao,
>>> I am seeing another issue. Your fix prevents from TRANS_START to get
>>> in the way of a committing transaction. But it does not prevent from
>>> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
>>> following loop:
>>>
>>> do {
>>> // attempt to do some useful stuff and/or sleep
>>> } while (atomic_read(&cur_trans->num_writers) > 1 ||
>>>(should_grow && cur_trans->num_joined != joined));
>>>
>>> What I see is basically that new writers join the transaction, while
>>> btrfs_commit_transaction() does this loop. I see
>>> cur_trans->num_writers decreasing, but then it increases, then
>>> decreases etc. This can go for several seconds during heavy IO load.
>>> There is nothing to prevent new TRANS_JOIN writers coming and joining
>>> a transaction over and over, thus delaying transaction commit. The IO
>>> path uses TRANS_JOIN; for example run_delalloc_nocow() does that.
>>>
>>> Do you observe such behavior? Do you believe it's problematic?
>>
>> I know this behavior, there is no problem with it, the latter code
>> will prevent from TRANS_JOIN.
>>
>> 1672 spin_lock(&root->fs_info->trans_lock);
>> 1673 root->fs_info->trans_no_join = 1;
>> 1674 spin_unlock(&root->fs_info->trans_lock);
>> 1675 wait_event(cur_trans->writer_wait,
>> 1676atomic_read(&cur_trans->num_writers) == 1);
>>
> Yes, this code prevents anybody from joining, but before
> btrfs_commit_transaction() gets to this code, it may spend sometimes
> 10 seconds (in my tests) in the do-while loop, while new writers come
> and go. Basically, it is not deterministic when the do-while loop will
> exit, it depends on the IO pattern.
>
>> And if we block the TRANS_JOIN at the place you point out, the deadlock
>> will happen because we need deal with the ordered operations which will
>> use TRANS_JOIN here.
>>
>> (I am dealing with the problem you said above by adding a new type of
>> TRANS_* now)
>
> Thanks.
> Alex.
>
>
>>
>> Thanks
>> Miao
>>
>>> Thanks,
>>> Alex.
>>>
>>>
>>> On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie  wrote:
>>>> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>>>>> Hi Miao,
>>>>> can you please explain your solution a bit more.
>>>>>
>>>>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie  wrote:
>>>>>> Now btrfs_commit_transaction() does this
>>>>>>
>>>>>> ret = btrfs_run_ordered_operations(root, 0)
>>>>>>
>>>>>> which async flushes all inodes on the ordered operations list, it 
>>>>>> introduced
>>>>>> a deadlock that transaction-start task, transaction-commit task and the 
>>>>>> flush
>>>>>> workers waited for each other.
>>>>>> (See the following URL to get the detail
>>>>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2

Re: Backup Options

2013-04-09 Thread Alex Lyakas

Hi David,
maybe my old patch
http://www.spinics.net/lists/linux-btrfs/msg19739.html
can help this issue?

Thanks,
Alex.


On Wed, Apr 3, 2013 at 8:23 PM, David Sterba  wrote:
> On Wed, Apr 03, 2013 at 04:33:22AM +0200, Harald Glatt wrote:
>> However what I actually did was:
>> # cd /mnt/restore
>> # nc -l -p  | btrfs receive .
>>
>> After noticing this difference I had to try it again as described in
>> my mail and - oh wonder - it works now!! Giving 'btrfs receive' a dot
>> as a parameter seems to fail in this case. Is this expected behavior
>> or a bug?
>
> Bug. Relative paths do not work on the receive side.
>
> david
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] Btrfs: fix locking on ROOT_REPLACE operations in tree mod log

2013-04-02 Thread Alex Lyakas

Hi Jan,
I have manually applied this patch and also your previous patch onto
kernel 3.8.2, but, unfortunately, I am still hitting the issue:(
I will check further whether I can be more helpful in debugging this
issue, than just reporting it:(

Thank for your help,
Alex.



On Wed, Mar 20, 2013 at 3:49 PM, Jan Schmidt  wrote:
> To resolve backrefs, ROOT_REPLACE operations in the tree mod log are
> required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation.
> Therefore, those operations must be enclosed by tree_mod_log_write_lock()
> and tree_mod_log_write_unlock() calls.
>
> Those calls are private to the tree_mod_log_* functions, which means that
> removal of the elements of an old root node must be logged from
> tree_mod_log_insert_root. This partly reverts and corrects commit ba1bfbd5
> (Btrfs: fix a tree mod logging issue for root replacement operations).
>
> This fixes the brand-new version of xfstest 276 as of commit cfe73f71.
>
> Signed-off-by: Jan Schmidt 
> ---
> Has probably been Reported-by: Alex Lyakas 
> (unconfirmed).
>
> Chages for v2:
> - use the correct base (current cmason/for-linus)
>
>  fs/btrfs/ctree.c |   30 --
>  1 files changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index ecd25a1..ca9d8f1 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -651,6 +651,8 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
> if (tree_mod_dont_log(fs_info, NULL))
> return 0;
>
> +   __tree_mod_log_free_eb(fs_info, old_root);
> +
> ret = tree_mod_alloc(fs_info, flags, &tm);
> if (ret < 0)
> goto out;
> @@ -736,7 +738,7 @@ tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 
> start, u64 min_seq)
>  static noinline void
>  tree_mod_log_eb_copy(struct btrfs_fs_info *fs_info, struct extent_buffer 
> *dst,
>  struct extent_buffer *src, unsigned long dst_offset,
> -unsigned long src_offset, int nr_items)
> +unsigned long src_offset, int nr_items, int log_removal)
>  {
> int ret;
> int i;
> @@ -750,10 +752,12 @@ tree_mod_log_eb_copy(struct btrfs_fs_info *fs_info, 
> struct extent_buffer *dst,
> }
>
> for (i = 0; i < nr_items; i++) {
> -   ret = tree_mod_log_insert_key_locked(fs_info, src,
> -i + src_offset,
> -MOD_LOG_KEY_REMOVE);
> -   BUG_ON(ret < 0);
> +   if (log_removal) {
> +   ret = tree_mod_log_insert_key_locked(fs_info, src,
> +   i + src_offset,
> +   MOD_LOG_KEY_REMOVE);
> +   BUG_ON(ret < 0);
> +   }
> ret = tree_mod_log_insert_key_locked(fs_info, dst,
>  i + dst_offset,
>  MOD_LOG_KEY_ADD);
> @@ -927,7 +931,6 @@ static noinline int update_ref_for_cow(struct 
> btrfs_trans_handle *trans,
> ret = btrfs_dec_ref(trans, root, buf, 1, 1);
> BUG_ON(ret); /* -ENOMEM */
> }
> -   tree_mod_log_free_eb(root->fs_info, buf);
> clean_tree_block(trans, root, buf);
> *last_ref = 1;
> }
> @@ -1046,6 +1049,7 @@ static noinline int __btrfs_cow_block(struct 
> btrfs_trans_handle *trans,
> btrfs_set_node_ptr_generation(parent, parent_slot,
>   trans->transid);
> btrfs_mark_buffer_dirty(parent);
> +   tree_mod_log_free_eb(root->fs_info, buf);
> btrfs_free_tree_block(trans, root, buf, parent_start,
>   last_ref);
> }
> @@ -1750,7 +1754,6 @@ static noinline int balance_level(struct 
> btrfs_trans_handle *trans,
> goto enospc;
> }
>
> -   tree_mod_log_free_eb(root->fs_info, root->node);
> tree_mod_log_set_root_pointer(root, child);
> rcu_assign_pointer(root->node, child);
>
> @@ -2995,7 +2998,7 @@ static int push_node_left(struct btrfs_trans_handle 
> *trans,
> push_items = min(src_nritems - 8, push_items);
>
> tree_mod_log_eb_copy(root->fs_info, dst, src, dst_nritems, 0,
> -push_items);
> +push_it

Re: btrfs "stuck" on

2013-04-02 Thread Alex Lyakas

Hi David,

On Fri, Mar 29, 2013 at 8:12 PM, David Sterba  wrote:
> On Thu, Mar 21, 2013 at 11:56:37AM -0700, Ask Bjørn Hansen wrote:
>> A few weeks ago I replaced a ZFS backup system with one backed by
>> btrfs. A script loops over a bunch of hosts rsyncing them to each
>> their own subvolume.  After each rsync I snapshot the "host-specific"
>> subvolume.
>>
>> The "disk" is an iscsi disk that in my benchmarks performs roughly
>> like a local raid with 2-3 SATA disks.
>>
>> It worked fine for about a week (~150 snapshots from ~20 sub volumes)
>> before it "suddenly" exploded in disk io wait. Doing anything (in
>> particular changes) on the file system is just insanely slow, rsync
>> basically can't complete (an rsync that should take 10-20 minutes
>> takes 24 hours; I have a directory of 60k files I tried deleting and
>> it's deleting one file every few minutes, that sort of thing).
>
> I'm seeing similar problem after a test that produces tons of snapshots
> and snap deletions at the same time. Accessing the directory (eg. via
> ls) containing the snapshots blocks for a long time.
>
> The contention point is a mutex of the directory entry, used for lookups
> on the 'ls' side, and the snapshot deletion process holds the mutex as
> well with obvious consequences. The contention is multiplied by the
> number of snapshots waiting to be deleted and eagerly grabbing the
> mutex, making other waiters starve.

Can you pls clarify what mutex do you mean? Do you mean the
dir->i_mutex, taken by btrfs_ioctl_snap_destroy()? If yes, then this
mutex is held only while "adding a snap to todo deletion list", and
not during snap deletion itself. Otherwise, I don't see
btrfs_drop_snapshot() locking any mutex, for example.

>
> You've observed this as deletion progressing very slowly and rsync
> blocked. That's really annoying and I'm working towards fixing it.
>
>> I am using 3.8.2-206.fc18.x86_64 (Fedora 18). I tried rebooting, it
>> doesn't make a difference. As soon as I boot "[btrfs-cleaner]" and
>> "[btrfs-transacti]" gets really busy.
>>
>> I wonder if it's because I deleted a few snapshots at some point?
>
> Yes. The progress or performance impact depends on amount of data shared
> among the snapshots and used / free space fragmentation.
>
> david
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-03-25 Thread Alex Lyakas

Hi Miao,

On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie  wrote:
> On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
>> Hi Miao,
>> I am seeing another issue. Your fix prevents from TRANS_START to get
>> in the way of a committing transaction. But it does not prevent from
>> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
>> following loop:
>>
>> do {
>> // attempt to do some useful stuff and/or sleep
>> } while (atomic_read(&cur_trans->num_writers) > 1 ||
>>(should_grow && cur_trans->num_joined != joined));
>>
>> What I see is basically that new writers join the transaction, while
>> btrfs_commit_transaction() does this loop. I see
>> cur_trans->num_writers decreasing, but then it increases, then
>> decreases etc. This can go for several seconds during heavy IO load.
>> There is nothing to prevent new TRANS_JOIN writers coming and joining
>> a transaction over and over, thus delaying transaction commit. The IO
>> path uses TRANS_JOIN; for example run_delalloc_nocow() does that.
>>
>> Do you observe such behavior? Do you believe it's problematic?
>
> I know this behavior, there is no problem with it, the latter code
> will prevent from TRANS_JOIN.
>
> 1672 spin_lock(&root->fs_info->trans_lock);
> 1673 root->fs_info->trans_no_join = 1;
> 1674 spin_unlock(&root->fs_info->trans_lock);
> 1675 wait_event(cur_trans->writer_wait,
> 1676atomic_read(&cur_trans->num_writers) == 1);
>
Yes, this code prevents anybody from joining, but before
btrfs_commit_transaction() gets to this code, it may spend sometimes
10 seconds (in my tests) in the do-while loop, while new writers come
and go. Basically, it is not deterministic when the do-while loop will
exit, it depends on the IO pattern.

> And if we block the TRANS_JOIN at the place you point out, the deadlock
> will happen because we need deal with the ordered operations which will
> use TRANS_JOIN here.
>
> (I am dealing with the problem you said above by adding a new type of
> TRANS_* now)

Thanks.
Alex.


>
> Thanks
> Miao
>
>> Thanks,
>> Alex.
>>
>>
>> On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie  wrote:
>>> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>>>> Hi Miao,
>>>> can you please explain your solution a bit more.
>>>>
>>>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie  wrote:
>>>>> Now btrfs_commit_transaction() does this
>>>>>
>>>>> ret = btrfs_run_ordered_operations(root, 0)
>>>>>
>>>>> which async flushes all inodes on the ordered operations list, it 
>>>>> introduced
>>>>> a deadlock that transaction-start task, transaction-commit task and the 
>>>>> flush
>>>>> workers waited for each other.
>>>>> (See the following URL to get the detail
>>>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>>>
>>>>> As we know, if ->in_commit is set, it means someone is committing the
>>>>> current transaction, we should not try to join it if we are not JOIN
>>>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>>>>> the above problem. In this way, there is another benefit: there is no new
>>>>> transaction handle to block the transaction which is on the way of commit,
>>>>> once we set ->in_commit.
>>>>>
>>>>> Signed-off-by: Miao Xie 
>>>>> ---
>>>>>  fs/btrfs/transaction.c |   17 -
>>>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>>> index bc2f2d1..71b7e2e 100644
>>>>> --- a/fs/btrfs/transaction.c
>>>>> +++ b/fs/btrfs/transaction.c
>>>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct 
>>>>> btrfs_root *root)
>>>>> root->commit_root = btrfs_root_node(root);
>>>>>  }
>>>>>
>>>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>>>> +  int type)
>>>>> +{
>>>>> +   return !(trans->in_commit &&
>>>>> +type != TRANS_JOIN &&
>>>>> +type != TRANS_JOIN_NOLOCK);
>>>>&

1 2 3 >

1 - 100 of 249 matches

Mail list logo