from:"Miao Xie"

Re: [PATCH] Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.

2015-04-08 Thread Miao Xie

On Thu, 09 Apr 2015 12:08:43 +0800, Dongsheng Yang wrote:
> We need to fill inode when we found a node for it in delayed_nodes_tree.
> But we did not fill the ->last_trans currently, it will cause the test
> of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
> 
> Problem:
>   (1). test_fd = open(fname, O_RDWR|O_DIRECT)
>   (2). pwrite(test_fd, buf, 4096, 0)
>   (3). close(test_fd)
>   (4). drop_all_caches()  < "echo 3 > /proc/sys/vm/drop_caches"
>   (5). test_fd = open(fname, O_RDWR|O_DIRECT)
>   (6). fsync(test_fd);
>   < we did not get the correct log entry 
> for the file
> Reason:
>   When we re-open this file in (5), we would find a node
> in delayed_nodes_tree and fill the inode we are lookup with the
> information. But the ->last_trans is not filled, then the fsync()
> will check the ->last_trans and found it's 0 then say this inode
> is already in our tree which is commited, not recording the extents
> for it.
> 
> Fix:
>   This patch fill the ->last_trans properly and set the
> runtime_flags if needed in this situation. Then we can get the
> log entries we expected after (6) and generic/311 passed.
> 
> Signed-off-by: Dongsheng Yang 

Good catch!

Reviewed-by: Miao Xie 

> ---
>  fs/btrfs/delayed-inode.c |  2 ++
>  fs/btrfs/inode.c | 21 -
>  2 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 82f0c7c..9e8b435 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1801,6 +1801,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
>   set_nlink(inode, btrfs_stack_inode_nlink(inode_item));
>   inode_set_bytes(inode, btrfs_stack_inode_nbytes(inode_item));
>   BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item);
> +BTRFS_I(inode)->last_trans = btrfs_stack_inode_transid(inode_item);
> +
>   inode->i_version = btrfs_stack_inode_sequence(inode_item);
>   inode->i_rdev = 0;
>   *rdev = btrfs_stack_inode_rdev(inode_item);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d2e732d..b132936 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3628,25 +3628,28 @@ static void btrfs_read_locked_inode(struct inode 
> *inode)
>   BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
>   BTRFS_I(inode)->last_trans = btrfs_inode_transid(leaf, inode_item);
>  
> + inode->i_version = btrfs_inode_sequence(leaf, inode_item);
> + inode->i_generation = BTRFS_I(inode)->generation;
> + inode->i_rdev = 0;
> + rdev = btrfs_inode_rdev(leaf, inode_item);
> +
> + BTRFS_I(inode)->index_cnt = (u64)-1;
> + BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
> +
> +cache_index:
>   /*
>* If we were modified in the current generation and evicted from memory
>* and then re-read we need to do a full sync since we don't have any
>* idea about which extents were modified before we were evicted from
>* cache.
> +  *
> +  * This is required for both inode re-read from disk and delayed inode
> +  * in delayed_nodes_tree.
>*/
>   if (BTRFS_I(inode)->last_trans == root->fs_info->generation)
>   set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
>   &BTRFS_I(inode)->runtime_flags);
>  
> - inode->i_version = btrfs_inode_sequence(leaf, inode_item);
> - inode->i_generation = BTRFS_I(inode)->generation;
> - inode->i_rdev = 0;
> - rdev = btrfs_inode_rdev(leaf, inode_item);
> -
> - BTRFS_I(inode)->index_cnt = (u64)-1;
> - BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
> -
> -cache_index:
>   path->slots[0]++;
>   if (inode->i_nlink != 1 ||
>   path->slots[0] >= btrfs_header_nritems(leaf))
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC v6 6/9] vfs: Add sb_want_write() function to get vfsmount from a given sb.

2015-02-04 Thread Miao Xie

On Wed, 04 Feb 2015 10:10:55 +0800, Qu Wenruo wrote:
> *** Please DON'T merge this patch, it's only for disscusion purpose ***
> 
> There are sysfs interfaces in some fs, only btrfs yet, which will modify
> on-disk data.
> Unlike normal file operation routine we can use mnt_want_write_file() to
> protect the operation, change through sysfs won't to be binded to any file
> in the filesystem.
> 
> So introduce new sb_want_write() to do the protection agains a super
> block, which acts much like mnt_want_write() but will return success if
> the super block is read-write.
> 
> Since sysfs handler don't go through the normal vfsmount, so it won't
> increase the refcount of and even we have sb_want_write() waiting sb to
> be unfrozen, the fs can still be unmounted without problem.
> Causing the modules unable to be removed and user can find out what's
> wrong until 
> 
> To solve such problem, we have different strategies to solve it.
> 1) Extra check on last instance umount of a sb
> This is the method the patch uses.
> This method seems valid enough, since we want to get write protection on
> a sb, so it's OK for the sb if there is *ANY* mount instance.
> Problem 1.1)
> But lsof and other tools won't help if sb_want_write() on frozen fs cause
> it unable to be unmounted.
> 
> Problem 1.2)
> When get namespace involved, things will get more complicated.
> Like the following case:
>   Alice   |   Bob
> Mount devA on /mnt1 in her ns | Mount devA on /mnt2/ in his ns
> freeze /mnt1  |
> sb_want_write() (waiting) |
> umount /mnt1 (success since there is  |
> another mount instance)   |
>   | umount /mnt2 (fail since there
>   | is sb_want_write() waiting)
> 
> So Alice can't thaw the fs since there is no mount point for it now.
> 
> 2) Don't allow any umount of the sb if there is sb_want_write().
> More aggressive one, purpose by Miao Xie.
> Can't resolve problem 1.1) but will solve problem 1.2).

This is one of the two methods that I told you, but not the one I recommended.
What I wanted to recommend is that thaw the fs at the beginning of the
sb kill process, and in sb_want_write(), we check if the sb is active or
not after we pass sb_start_write, if the sb is not active, go back.
(This way also is not so good, but better than the above one)

> Although introduced new problem like the following:
>   Alice
> Mount devA on /mnt1
> freeze /mnt1
> sb_want_write() (waiting)
> mount devA on /mnt2 and /mnt3
> 
> /mnt[123] all can't be unmounted, but new mount can still be created.
> 
> 3) sb_want_write() doesn't make any sense and break VFS rules!
> Action which will change on-disk data should not be tunable through sysfs,
> and sb_want_write() things which by-pass all the VFS check is just evil.
> And for btrfs, we already have the ioctl to set label, why bothering new
> sysfs interface to do it again?
> 
> Although I use method 1) to do it, I am still not certain about which is
> method is the correct one.
> 
> So any advise is welcomed.
> 
> Thanks,
> Qu

[SNIP]

> +/**
> + * sb_want_write - get write acess to a super block
> + * @sb: the superblock of the filesystem
> + *
> + * This tells the low-level filesystem that a write is about to be performed 
> to
> + * it, and makes sure that the writes are allowed (superblock is read-write,
> + * filesystem is not frozen) before returning success.
> + * When the write operation is finished, sb_drop_write() must be called.
> + * This is much like mnt_want_write() as a refcount, but only needs
> + * the superblock to be read-write.
> + */
> +int sb_want_write(struct super_block *sb)
> +{
> + spin_lock(&sb->s_want_write_lock);
> + if (sb->s_want_write_block) {
> + spin_unlock(&sb->s_want_write_lock);
> + return -EBUSY;
> + }
> + sb->s_want_write_count++;
> + spin_unlock(&sb->s_want_write_lock);
> +
> + sb_start_write(sb);
> + if (sb->s_readonly_remount || sb->s_flags & MS_RDONLY) {

If someone remount the fs to R/O here(after the check), we should not continue
to change label/features. I think we need add some check in remount functions.

Thanks
Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 0/9] btrfs: Fix freeze/sysfs deadlock in better method.

2015-01-30 Thread Miao Xie

On Fri, 30 Jan 2015 20:17:49 +0100, David Sterba wrote:
> On Fri, Jan 30, 2015 at 05:20:45PM +0800, Qu Wenruo wrote:
>> [Use VFS protect for sysfs change]
>> The 6th patch will introduce a new help function sb_want_write() to
>> claim write permission to a superblock.
>> With this, we are able to do write protection like mnt_want_write() but
>> only needs to ensure that the superblock is writeable.
>> This also keeps the same synchronized behavior using ioctl, which will
>> block on frozen fs until it is unfrozen.
> 
> You know what I think abuot the commit inside sysfs, but it looks better
> to me now with the sb_* protections so I give it a go.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I worried about the following case

# fsfreeze btrfs
# echo "new label" > btrfs_sysfs
It should be hangup


On the other terminal
# umount btrfs


Because the 2nd echo command didn't increase mount reference, so umount
would not know someone still blocked on the fs, it would not go back and
return EBUSY like someone access the fs by common fs interface, it would
deactive fs directly and then blocked on sysfs removal.


Thanks
Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.

2015-01-29 Thread Miao Xie

On Fri, 30 Jan 2015 04:37:14 +, Al Viro wrote:
> On Fri, Jan 30, 2015 at 12:14:24PM +0800, Miao Xie wrote:
>> On Fri, 30 Jan 2015 02:14:45 +, Al Viro wrote:
>>> On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote:
>>>
>>>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
>>>
>>> Wrong.  Individual vfsmounts over an r/w superblock might very well be r/o.
>>> As for that trylock...  What for?  It invites transient failures for no
>>> good reason.  Removal of sysfs entry will block while write(2) to that 
>>> sucker
>>> is in progress, so btrfs shutdown will block at that point in ctree_close().
>>> It won't go away under you.
>>
>> could you explain the race condition? I think the deadlock won't happen, 
>> during
>> the btrfs shutdown, we hold s_umount, the write operation will fail to lock 
>> it,
>> and quit quickly, and then umount will continue.
> 
>   First of all, ->s_umount is not a mutex; it's rwsem.  So you mean
> down_read_trylock().  As for the transient failures - grep for down_write
> on it...  E.g. have somebody call mount() from the same device.  We call
> sget(), which finds existing superblock and calls grab_super().  Sure,
> that ->s_umount will be released shortly, but in the meanwhile your trylock
> will fail...

I know it, so I suggested to return -EBUSY in the previous mail.
I think it is acceptable method, mount/umount operations are not so many
after all.

Thanks
Miao

> 
>> I think sb_want_write() is similar to trylock(s_umount), the difference is 
>> that
>> sb_want_write() is more complex.
>>
>>>
>>> Now, you might want to move those sysfs entry removals to the very beginning
>>> of btrfs_kill_super(), but that's a different story - you need only to make
>>> sure that they are removed not later than the destruction of the data
>>> structures they need (IOW, the current location might very well be OK - I
>>> hadn't checked the details).
>>
>> Yes, we need move those sysfs entry removals, but needn't move to the very
>> beginning of btrfs_kill_super(), just at the beginning of close_ctree();
> 
> So move them...  It's a matter of moving one function call around a bit.
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.

2015-01-29 Thread Miao Xie

On Fri, 30 Jan 2015 02:14:45 +, Al Viro wrote:
> On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote:
> 
>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
> 
> Wrong.  Individual vfsmounts over an r/w superblock might very well be r/o.
> As for that trylock...  What for?  It invites transient failures for no
> good reason.  Removal of sysfs entry will block while write(2) to that sucker
> is in progress, so btrfs shutdown will block at that point in ctree_close().
> It won't go away under you.

could you explain the race condition? I think the deadlock won't happen, during
the btrfs shutdown, we hold s_umount, the write operation will fail to lock it,
and quit quickly, and then umount will continue.

I think sb_want_write() is similar to trylock(s_umount), the difference is that
sb_want_write() is more complex.

> 
> Now, you might want to move those sysfs entry removals to the very beginning
> of btrfs_kill_super(), but that's a different story - you need only to make
> sure that they are removed not later than the destruction of the data
> structures they need (IOW, the current location might very well be OK - I
> hadn't checked the details).

Yes, we need move those sysfs entry removals, but needn't move to the very
beginning of btrfs_kill_super(), just at the beginning of close_ctree();

The current location is not right, it will introduce the use-after-free
problem. because we remove the sysfs entry after we release
transaction_kthread, use-after-free problem might happen in this case
Task1   Task2
change Label by sysfs
close_ctree
  kthread_stop(transaction_kthread);
  change label
  wake_up(transaction_kthread)

Thanks
Miao

> 
> As for "it won't go r/o under us" - sb_want_write() will do that just fine.
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.

2015-01-29 Thread Miao Xie

On Fri, 30 Jan 2015 10:02:26 +0800, Qu Wenruo wrote:
> 
>  Original Message 
> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
> vfsmount from a given sb.
> From: Qu Wenruo 
> To: Miao Xie , linux-btrfs@vger.kernel.org
> Date: 2015年01月30日 09:44
>>
>>  Original Message 
>> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
>> vfsmount from a given sb.
>> From: Miao Xie 
>> To: Qu Wenruo , 
>> Date: 2015年01月30日 08:52
>>> On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
>>>> There are sysfs interfaces in some fs, only btrfs yet, which will modify
>>>> on-disk data.
>>>> Unlike normal file operation routine we can use mnt_want_write_file() to
>>>> protect the operation, change through sysfs won't to be binded to any file
>>>> in the filesystem.
>>>> So we can only extract the first vfsmount of a superblock and pass it to
>>>> mnt_want_write() to do the protection.
>>> This method is wrong, becasue one fs  may be mounted on the multi places
>>> at the same time, someone is R/O, someone is R/W, you may get a R/O and
>>> fail to get the write permission.
>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
>> You can mount a device which is already mounted as rw to other point as ro,
>> and remount a mount point to ro will also cause all other mount point to ro.
>>
>> So I didn't see the problem here.
>>>
>>> I think you do label/feature change by sysfs interface by the following way
>>>
>>> btrfs_sysfs_change_()
>>> {
>>> /* Use trylock to avoid the race with umount */
>>> if(!mutex_trylock(&sb->s_umount))
>>> return -EBUSY;
>>>
>>> check R/O and FREEZE
>>>
>>> mutex_unlock(&sb->s_umount);
>>> }
>> This looks better since it not introduce changes to VFS.
>>
>> Thanks,
>> Qu
> Oh, wait a second, this one leads to the old problem and old solution.
> 
> If we hold s_umount mutex, we must do freeze check and can't start transaction
> since it will deadlock.
> 
> And for freeze check, we must use sb_try_start_intwrite() to hold the freeze
> lock and then add a new
> btrfs_start_transaction_freeze() which will not call sb_start_write()...
> 
> Oh this seems so similar, v2 or v3 version RFC patch?
> So still goes to the old method?

No. Just check R/O and RREEZE, if failed, go out. if the check pass,
we start_transaction. Because we do it in s_umount lock, no one can
change fs to R/O or FREEZE.

Maybe the above description is not so clear, explain it again.

btrfs_sysfs_change_()
{
/* Use trylock to avoid the race with umount */
if(!mutex_trylock(&sb->s_umount))
return -EBUSY;

if (fs is R/O or FREEZED) {
mutex_unlock(&sb->s_umount);
return -EACCES;
}

btrfs_start_transaction()
change label/feature
btrfs_commit_transaction()

mutex_unlock(&sb->s_umount);
}

Thanks
Miao

> 
> Thanks,
> Qu
>>>
>>> Thanks
>>> Miao
>>>
>>>> Cc: linux-fsdevel 
>>>> Signed-off-by: Qu Wenruo 
>>>> ---
>>>>   fs/namespace.c| 25 +
>>>>   include/linux/mount.h |  1 +
>>>>   2 files changed, 26 insertions(+)
>>>>
>>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>>> index cd1e968..5a16a62 100644
>>>> --- a/fs/namespace.c
>>>> +++ b/fs/namespace.c
>>>> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>>>>   }
>>>>   EXPORT_SYMBOL(mntget);
>>>>   +/*
>>>> + * get a vfsmount from a given sb
>>>> + *
>>>> + * This is especially used for case where change fs' sysfs interface
>>>> + * will lead to a write, e.g. Change label through sysfs in btrfs.
>>>> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
>>>> + */
>>>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>>>> +{
>>>> +struct vfsmount *ret_vfs = NULL;
>>>> +struct mount *mnt;
>>>> +int ret = 0;
>>>> +
>>>> +lock_mount_hash();
>>>> +if (list_empty(&sb->s_mounts))
>>>> +goto out;
>>>> +mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>>&g

Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way

2015-01-29 Thread Miao Xie

On Fri, 30 Jan 2015 10:51:52 +0800, Qu Wenruo wrote:
> 
>  Original Message 
> Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse 
> mount
> option in a atomic way
> From: Miao Xie 
> To: Qu Wenruo , 
> Date: 2015年01月30日 10:06
>> On Fri, 30 Jan 2015 09:33:17 +0800, Qu Wenruo wrote:
>>>  Original Message 
>>> Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse 
>>> mount
>>> option in a atomic way
>>> From: Miao Xie 
>>> To: Qu Wenruo , 
>>> Date: 2015年01月30日 09:29
>>>> On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
>>>>>> Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
>>>>>> info->mount_opt instead of new_opt.
>>>>> Thanks for pointing out this one.
>>>>>> But I worried that this is not key reason of the wrong space cache. Could
>>>>>> you explain the race condition which caused the wrong space cache?
>>>>>>
>>>>>> Thanks
>>>>>> Miao
>>>>> CPU0:
>>>>> remount()
>>>>> |- sync_fs() <- after sync_fs() we can start new trans
>>>>> |- btrfs_parse_options() CPU1:
>>>>>   |- start_transaction()
>>>>>   |- Do some bg allocation, not recorded in 
>>>>> space_cache.
>>>> I think it is a bug if a free space is not recorded in space cache. Could 
>>>> you
>>>> explain why it is not recorded?
>>>>
>>>> Thanks
>>>> Miao
>>> IIRC, in that window, the fs_info->mount_opt's SPACE_CACHE bit is cleared.
>>> So space cache is not recorded.
>> SPACE_CACHE is used to control cache write out, not in-memory cache. All the
>> free space should be recorded in in-memory cache.And when we write out
>> the in-memory space cache, we need protect the space cache from changing.
>>
>> Thanks
>> Miao
> You're right, the wrong space cache problem is not caused by the non-atomic
> mount option problem.
> But the atomic mount option change with per-transaction mount option will at
> least make it disappear
> when using nospace_cache mount option.

But we need fix a problem, not hide a problem.

Thanks
Miao

> 
> Thanks,
> Qu
>>
>>> Thanks,
>>> Qu
>>>>>|- set SPACE_CACHE bit due to cache_gen
>>>>>
>>>>>   |- commit_transaction()
>>>>>   |- write space cache and update cache_gen.
>>>>>   but since some of it is not recorded in 
>>>>> space
>>>>> cache,
>>>>>   the space cache missing some records.
>>>>>|- clear SPACE_CACHE bit dut to nospace_cache
>>>>>
>>>>> So the space cache is wrong.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>>> +}
>>>>>>> kfree(orig);
>>>>>>> return ret;
>>>>>>> }
>>>>>>>
>>>>> .
>>>>>
>>>
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way

2015-01-29 Thread Miao Xie

On Fri, 30 Jan 2015 09:33:17 +0800, Qu Wenruo wrote:
> 
>  Original Message 
> Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse 
> mount
> option in a atomic way
> From: Miao Xie 
> To: Qu Wenruo , 
> Date: 2015年01月30日 09:29
>> On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
>>>> Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
>>>> info->mount_opt instead of new_opt.
>>> Thanks for pointing out this one.
>>>> But I worried that this is not key reason of the wrong space cache. Could
>>>> you explain the race condition which caused the wrong space cache?
>>>>
>>>> Thanks
>>>> Miao
>>> CPU0:
>>> remount()
>>> |- sync_fs() <- after sync_fs() we can start new trans
>>> |- btrfs_parse_options() CPU1:
>>>  |- start_transaction()
>>>  |- Do some bg allocation, not recorded in space_cache.
>> I think it is a bug if a free space is not recorded in space cache. Could you
>> explain why it is not recorded?
>>
>> Thanks
>> Miao
> IIRC, in that window, the fs_info->mount_opt's SPACE_CACHE bit is cleared.
> So space cache is not recorded.

SPACE_CACHE is used to control cache write out, not in-memory cache. All the
free space should be recorded in in-memory cache.And when we write out
the in-memory space cache, we need protect the space cache from changing.

Thanks
Miao

> 
> Thanks,
> Qu
>>
>>>   |- set SPACE_CACHE bit due to cache_gen
>>>
>>>  |- commit_transaction()
>>>  |- write space cache and update cache_gen.
>>>  but since some of it is not recorded in space
>>> cache,
>>>  the space cache missing some records.
>>>   |- clear SPACE_CACHE bit dut to nospace_cache
>>>
>>> So the space cache is wrong.
>>>
>>> Thanks,
>>> Qu
>>>>> +}
>>>>>kfree(orig);
>>>>>return ret;
>>>>>}
>>>>>
>>> .
>>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way

2015-01-29 Thread Miao Xie

On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
>> Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
>> info->mount_opt instead of new_opt.
> Thanks for pointing out this one.
>>
>> But I worried that this is not key reason of the wrong space cache. Could
>> you explain the race condition which caused the wrong space cache?
>>
>> Thanks
>> Miao
> CPU0:
> remount()
> |- sync_fs() <- after sync_fs() we can start new trans
> |- btrfs_parse_options() CPU1:
> |- start_transaction()
> |- Do some bg allocation, not recorded in space_cache.

I think it is a bug if a free space is not recorded in space cache. Could you
explain why it is not recorded?

Thanks
Miao

>  |- set SPACE_CACHE bit due to cache_gen
> 
> |- commit_transaction()
> |- write space cache and update cache_gen.
> but since some of it is not recorded in space 
> cache,
> the space cache missing some records.
>  |- clear SPACE_CACHE bit dut to nospace_cache
> 
> So the space cache is wrong.
> 
> Thanks,
> Qu
>>
>>> +}
>>>   kfree(orig);
>>>   return ret;
>>>   }
>>>
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.

2015-01-29 Thread Miao Xie

On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
> There are sysfs interfaces in some fs, only btrfs yet, which will modify
> on-disk data.
> Unlike normal file operation routine we can use mnt_want_write_file() to
> protect the operation, change through sysfs won't to be binded to any file
> in the filesystem.
> So we can only extract the first vfsmount of a superblock and pass it to
> mnt_want_write() to do the protection.

This method is wrong, becasue one fs  may be mounted on the multi places
at the same time, someone is R/O, someone is R/W, you may get a R/O and
fail to get the write permission.

I think you do label/feature change by sysfs interface by the following way

btrfs_sysfs_change_()
{
/* Use trylock to avoid the race with umount */
if(!mutex_trylock(&sb->s_umount))
return -EBUSY;

check R/O and FREEZE

mutex_unlock(&sb->s_umount);
}

Thanks
Miao

> 
> Cc: linux-fsdevel 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/namespace.c| 25 +
>  include/linux/mount.h |  1 +
>  2 files changed, 26 insertions(+)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index cd1e968..5a16a62 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>  }
>  EXPORT_SYMBOL(mntget);
>  
> +/*
> + * get a vfsmount from a given sb
> + *
> + * This is especially used for case where change fs' sysfs interface
> + * will lead to a write, e.g. Change label through sysfs in btrfs.
> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
> + */
> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
> +{
> + struct vfsmount *ret_vfs = NULL;
> + struct mount *mnt;
> + int ret = 0;
> +
> + lock_mount_hash();
> + if (list_empty(&sb->s_mounts))
> + goto out;
> + mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
> + ret_vfs = &mnt->mnt;
> + ret_vfs = mntget(ret_vfs);
> +out:
> + unlock_mount_hash();
> + return ret_vfs;
> +}
> +EXPORT_SYMBOL(get_vfsmount_sb);
> +
>  struct vfsmount *mnt_clone_internal(struct path *path)
>  {
>   struct mount *p;
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index c2c561d..cf1b0f5 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
>  extern void mntput(struct vfsmount *mnt);
>  extern struct vfsmount *mntget(struct vfsmount *mnt);
>  extern struct vfsmount *mnt_clone_internal(struct path *path);
> +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
>  extern int __mnt_is_readonly(struct vfsmount *mnt);
>  
>  struct path;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way

2015-01-29 Thread Miao Xie

On Thu, 29 Jan 2015 10:24:35 +0800, Qu Wenruo wrote:
> Current btrfs_parse_options() is not atomic, which can set and clear a
> bit, especially for nospace_cache case.
> 
> For example, if a fs is mounted with nospace_cache,
> btrfs_parse_options() will set SPACE_CACHE bit first(since
> cache_generation is non-zeo) and clear the SPACE_CACHE bit due to
> nospace_cache mount option.
> So under heavy operations and remount a nospace_cache btrfs, there is a
> windows for commit to create space cache.
> 
> This bug can be reproduced by fstest/btrfs/071 073 074 with
> nospace_cache mount option. It has about 50% chance to create space
> cache, and about 10% chance to create wrong space cache, which can't
> pass btrfsck.
> 
> This patch will do the mount option parse in a copy-and-update method.
> First copy the mount_opt from fs_info to new_opt, and only update
> options in new_opt. At last, copy the new_opt back to
> fs_info->mount_opt.
> 
> This patch is already good enough to fix the above nospace_cache +
> remount bug, but need later patch to make sure mount options does not
> change during transaction.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/ctree.h |  16 
>  fs/btrfs/super.c | 115 
> +--
>  2 files changed, 69 insertions(+), 62 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5f99743..26bb47b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2119,18 +2119,18 @@ struct btrfs_ioctl_defrag_range_args {
>  #define btrfs_test_opt(root, opt)((root)->fs_info->mount_opt & \
>BTRFS_MOUNT_##opt)
>  
> -#define btrfs_set_and_info(root, opt, fmt, args...)  \
> +#define btrfs_set_and_info(fs_info, val, opt, fmt, args...)  \
>  {\
> - if (!btrfs_test_opt(root, opt)) \
> - btrfs_info(root->fs_info, fmt, ##args); \
> - btrfs_set_opt(root->fs_info->mount_opt, opt);   \
> + if (!btrfs_raw_test_opt(val, opt))  \
> + btrfs_info(fs_info, fmt, ##args);   \
> + btrfs_set_opt(val, opt);\
>  }
>  
> -#define btrfs_clear_and_info(root, opt, fmt, args...)
> \
> +#define btrfs_clear_and_info(fs_info, val, opt, fmt, args...)
> \
>  {\
> - if (btrfs_test_opt(root, opt))  \
> - btrfs_info(root->fs_info, fmt, ##args); \
> - btrfs_clear_opt(root->fs_info->mount_opt, opt); \
> + if (btrfs_raw_test_opt(val, opt))   \
> + btrfs_info(fs_info, fmt, ##args);   \
> + btrfs_clear_opt(val, opt);  \
>  }
>  
>  /*
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index b0c45b2..490fe1f 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -395,10 +395,13 @@ int btrfs_parse_options(struct btrfs_root *root, char 
> *options)
>   int ret = 0;
>   char *compress_type;
>   bool compress_force = false;
> + unsigned long new_opt;
> +
> + new_opt = info->mount_opt;

Here and

>  
>   cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy);
>   if (cache_gen)
> - btrfs_set_opt(info->mount_opt, SPACE_CACHE);
[SNIP]
>  out:
> - if (!ret && btrfs_test_opt(root, SPACE_CACHE))
> - btrfs_info(root->fs_info, "disk space caching is enabled");
> + if (!ret) {
> + if (btrfs_raw_test_opt(new_opt, SPACE_CACHE))
> + btrfs_info(info, "disk space caching is enabled");
> + info->mount_opt = new_opt;

Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
info->mount_opt instead of new_opt.

But I worried that this is not key reason of the wrong space cache. Could
you explain the race condition which caused the wrong space cache?

Thanks
Miao

> + }
>   kfree(orig);
>   return ret;
>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-25 Thread Miao Xie

On Fri, 23 Jan 2015 17:59:49 +0100, David Sterba wrote:
> On Wed, Jan 21, 2015 at 03:04:02PM +0800, Miao Xie wrote:
>>> Pending changes are *not* only mount options. Feature change and label 
>>> change
>>> are also pending changes if using sysfs.
>>
>> My miss, I don't notice feature and label change by sysfs.
>>
>> But the implementation of feature and label change by sysfs is wrong, we can
>> not change them without write permission.
> 
> Label change does not happen if the fs is readonly. If the filesystem is
> RW and label is changed through sysfs, then remount to RO will sync the
> filesystem and the new label will be saved.
> 
> The sysfs features write handler is missing that protection though, I'll
> send a patch.

First, the R/O protection is so cheap, there is a race between R/O remount and
label/feature change, please consider the following case:
Remount R/O taskLabel/Attr Change Task
Check R/O
remount ro R/O
change Label/feature

Second, it forgets to handle the freezing event.

> 
>>> For freeze, it's not the same problem since the fs will be unfreeze sooner 
>>> or
>>> later and transaction will be initiated.
>>
>> You can not assume the operations of the users, they might freeze the fs and
>> then shutdown the machine.
> 
> The semantics of freezing should make the on-device image consistent,
> but still keep some changes in memory.
> 
>>>>> For example, if we change the features/label through sysfs, and then 
>>>>> umount
>>>>> the fs,
>>>> It is different from pending change.
>>> No, now features/label changing using sysfs both use pending changes to do 
>>> the
>>> commit.
>>> See BTRFS_PENDING_COMMIT bit.
>>> So freeze -> change features/label -> sync will still cause the deadlock in 
>>> the
>>> same way,
>>> and you can try it yourself.
>>
>> As I said above, the implementation of sysfs feature and label change is 
>> wrong,
>> it is better to separate them from the pending mount option change, make the
>> sysfs feature and label change be done in the context of transaction after
>> getting the write permission. If so, we needn't do anything special when sync
>> the fs.
> 
> That would mean to drop the write support of sysfs files that change
> global filesystem state (label and features right now). This would leave
> only the ioctl way to do that. I'd like to keep the sysfs write support
> though for ease of use from scripts and languages not ioctl-friendly.
> .

not drop the write support of sysfs, just fix the bug and make it change the
label and features under the writable context.

Thanks
Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-21 Thread Miao Xie

On Wed, 21 Jan 2015 15:47:54 +0800, Qu Wenruo wrote:
>> On Wed, 21 Jan 2015 11:53:34 +0800, Qu Wenruo wrote:
> [snipped]
> This will cause another problem, nobody can ensure there will be next
> transaction and the change may
> never to written into disk.
 First, the pending changes is mount option, that is in-memory data.
 Second, the same problem would happen after you freeze fs.
>>> Pending changes are *not* only mount options. Feature change and label 
>>> change
>>> are also pending changes if using sysfs.
>> My miss, I don't notice feature and label change by sysfs.
>>
>> But the implementation of feature and label change by sysfs is wrong, we can
>> not change them without write permission.
>>
>>> Normal ioctl label changing is not affected.
>>>
>>> For freeze, it's not the same problem since the fs will be unfreeze sooner 
>>> or
>>> later and transaction will be initiated.
>> You can not assume the operations of the users, they might freeze the fs and
>> then shutdown the machine.
>>
> For example, if we change the features/label through sysfs, and then 
> umount
> the fs,
 It is different from pending change.
>>> No, now features/label changing using sysfs both use pending changes to do 
>>> the
>>> commit.
>>> See BTRFS_PENDING_COMMIT bit.
>>> So freeze -> change features/label -> sync will still cause the deadlock in 
>>> the
>>> same way,
>>> and you can try it yourself.
>> As I said above, the implementation of sysfs feature and label change is 
>> wrong,
>> it is better to separate them from the pending mount option change, make the
>> sysfs feature and label change be done in the context of transaction after
>> getting the write permission. If so, we needn't do anything special when sync
>> the fs.
>>
>> In short, changing the sysfs feature and label change implementation and
>> removing the unnecessary btrfs_start_transaction in sync_fs can fix the
>> deadlock.
> Your method will only fix the deadlock, but will introduce the risk like 
> pending
> inode_cache will never
> be written to disk as I already explained. (If still using the
> fs_info->pending_changes mechanism)
> To ensure pending changes written to disk sync_fs() should start a transaction
> if needed,
> or there will be chance that no transaction can handle it.

We are sure that writting down the inode cache need transaction. But INODE_CACHE
is not a forcible flag. Sometimes though you set it, it is very likely that the
inode cache files are not created and the data is not written down because the
fs might still be reading inode usage information, and this operation might span
several transactions. So I think what you worried is not a problem.

Thanks
Miao

> 
> But I don't see the necessity to pending current work(inode_cache, 
> feature/label
> changes) to next transaction.
> 
> To David:
> I'm a little curious about why inode_cache needs to be delayed to next 
> transaction.
> In btrfs_remount() we have s_umount mutex, and we synced the whole filesystem
> already,
> so there should be no running transaction and we can just set any mount option
> into fs_info.
> 
> Or even in worst case, there is a racing window, we can still start a
> transaction and do the commit,
> a little overhead in such minor case won't impact the overall performance.
> 
> For sysfs change, I prefer attach or start transaction method, and for mount
> option change, and
> such sysfs tuning is also minor case for a filesystem.
> 
> What do you think about reverting the whole patchset and rework the sysfs
> interface?
> 
> Thanks,
> Qu
>>
>> Thanks
>> Miao
>>
>>> Thanks,
>>> Qu
>>>
 If you want to change features/label,  you should get write permission and 
 make
 sure the fs is not be freezed because those are on-disk data. So the 
 problem
 doesn't exist, or there is a bug.

 Thanks
 Miao

> since there is no write, there is no running transaction and if we don't
> start a
> new transaction,
> it won't be flushed to disk.
>
> Thanks,
> Qu
>> the reason is:
>> - Make the behavior of the fs be consistent(both freezed fs and 
>> unfreezed fs)
>> - Data on the disk is right and integrated
>>
>>
>> Thanks
>> Miao
> .
>
>>> .
>>>
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-20 Thread Miao Xie

On Wed, 21 Jan 2015 11:53:34 +0800, Qu Wenruo wrote:
>> +/*
>> + * Test if the fs is frozen, or start_trasaction
>> + * will deadlock on itself.
>> + */
>> +if (__sb_start_write(sb, SB_FREEZE_FS, false))
>> +__sb_end_write(sb, SB_FREEZE_FS);
>> +else
>> +return 0;
>>> But what if someone freezes the FS after __sb_end_write() and before
>>> btrfs_start_transaction()?   I don't see what keeps new freezers from
>>> coming in.
>>>
>>> -chris
>> Either VFS::freeze_super() and VFS::syncfs() will hold the s_umount 
>> mutex, so
>> freeze will not happen
>> during sync.
> You're right.  I was worried about the sync ioctl, but the mutex won't be 
> held
> there to deadlock against.  We'll be fine.
 There is another problem which is introduced by pending change. That is we 
 will
 start and commmit a transaction by changing pending mount option after we 
 set
 the fs to be R/O.
>>> Oh, I missed this problem.
 I think it is better that we don't start a new transaction for pending 
 changes
 which are set after the transaction is committed, just make them be 
 handled by
 the next transaction,
>>> This will cause another problem, nobody can ensure there will be next
>>> transaction and the change may
>>> never to written into disk.
>> First, the pending changes is mount option, that is in-memory data.
>> Second, the same problem would happen after you freeze fs.
> Pending changes are *not* only mount options. Feature change and label change
> are also pending changes if using sysfs.

My miss, I don't notice feature and label change by sysfs.

But the implementation of feature and label change by sysfs is wrong, we can
not change them without write permission.

> Normal ioctl label changing is not affected.
> 
> For freeze, it's not the same problem since the fs will be unfreeze sooner or
> later and transaction will be initiated.

You can not assume the operations of the users, they might freeze the fs and
then shutdown the machine.

>>
>>> For example, if we change the features/label through sysfs, and then umount
>>> the fs,
>> It is different from pending change.
> No, now features/label changing using sysfs both use pending changes to do the
> commit.
> See BTRFS_PENDING_COMMIT bit.
> So freeze -> change features/label -> sync will still cause the deadlock in 
> the
> same way,
> and you can try it yourself.

As I said above, the implementation of sysfs feature and label change is wrong,
it is better to separate them from the pending mount option change, make the
sysfs feature and label change be done in the context of transaction after
getting the write permission. If so, we needn't do anything special when sync
the fs.

In short, changing the sysfs feature and label change implementation and
removing the unnecessary btrfs_start_transaction in sync_fs can fix the
deadlock.

Thanks
Miao

> 
> Thanks,
> Qu
> 
>> If you want to change features/label,  you should get write permission and 
>> make
>> sure the fs is not be freezed because those are on-disk data. So the problem
>> doesn't exist, or there is a bug.
>>
>> Thanks
>> Miao
>>
>>> since there is no write, there is no running transaction and if we don't 
>>> start a
>>> new transaction,
>>> it won't be flushed to disk.
>>>
>>> Thanks,
>>> Qu
 the reason is:
 - Make the behavior of the fs be consistent(both freezed fs and unfreezed 
 fs)
 - Data on the disk is right and integrated


 Thanks
 Miao
>>> .
>>>
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-20 Thread Miao Xie

On Wed, 21 Jan 2015 11:15:41 +0800, Qu Wenruo wrote:
> 
>  Original Message 
> Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs 
> to
> avoid deadlock.
> From: Miao Xie 
> To: Chris Mason , Qu Wenruo 
> Date: 2015年01月21日 11:10
>> On Tue, 20 Jan 2015 20:10:56 -0500, Chris Mason wrote:
>>> On Tue, Jan 20, 2015 at 8:09 PM, Qu Wenruo  wrote:
>>>>  Original Message 
>>>> Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen 
>>>> fs
>>>> to avoid deadlock.
>>>> From: Chris Mason 
>>>> To: Qu Wenruo 
>>>> Date: 2015年01月21日 09:05
>>>>>
>>>>> On Tue, Jan 20, 2015 at 7:58 PM, Qu Wenruo  
>>>>> wrote:
>>>>>>  Original Message 
>>>>>> Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on 
>>>>>> frozen
>>>>>> fs to avoid deadlock.
>>>>>> From: David Sterba 
>>>>>> To: Qu Wenruo 
>>>>>> Date: 2015年01月21日 01:13
>>>>>>> On Mon, Jan 19, 2015 at 03:42:41PM +0800, Qu Wenruo wrote:
>>>>>>>> --- a/fs/btrfs/super.c
>>>>>>>> +++ b/fs/btrfs/super.c
>>>>>>>> @@ -1000,6 +1000,14 @@ int btrfs_sync_fs(struct super_block *sb, int 
>>>>>>>> wait)
>>>>>>>> */
>>>>>>>>if (fs_info->pending_changes == 0)
>>>>>>>>return 0;
>>>>>>>> +/*
>>>>>>>> + * Test if the fs is frozen, or start_trasaction
>>>>>>>> + * will deadlock on itself.
>>>>>>>> + */
>>>>>>>> +if (__sb_start_write(sb, SB_FREEZE_FS, false))
>>>>>>>> +__sb_end_write(sb, SB_FREEZE_FS);
>>>>>>>> +else
>>>>>>>> +return 0;
>>>>> But what if someone freezes the FS after __sb_end_write() and before
>>>>> btrfs_start_transaction()?   I don't see what keeps new freezers from
>>>>> coming in.
>>>>>
>>>>> -chris
>>>> Either VFS::freeze_super() and VFS::syncfs() will hold the s_umount mutex, 
>>>> so
>>>> freeze will not happen
>>>> during sync.
>>> You're right.  I was worried about the sync ioctl, but the mutex won't be 
>>> held
>>> there to deadlock against.  We'll be fine.
>> There is another problem which is introduced by pending change. That is we 
>> will
>> start and commmit a transaction by changing pending mount option after we set
>> the fs to be R/O.
> Oh, I missed this problem.
>>
>> I think it is better that we don't start a new transaction for pending 
>> changes
>> which are set after the transaction is committed, just make them be handled 
>> by
>> the next transaction,
> This will cause another problem, nobody can ensure there will be next
> transaction and the change may
> never to written into disk.

First, the pending changes is mount option, that is in-memory data.
Second, the same problem would happen after you freeze fs.

> 
> For example, if we change the features/label through sysfs, and then umount 
> the fs,

It is different from pending change.
If you want to change features/label,  you should get write permission and make
sure the fs is not be freezed because those are on-disk data. So the problem
doesn't exist, or there is a bug.

Thanks
Miao

> since there is no write, there is no running transaction and if we don't 
> start a
> new transaction,
> it won't be flushed to disk.
> 
> Thanks,
> Qu
>> the reason is:
>> - Make the behavior of the fs be consistent(both freezed fs and unfreezed fs)
>> - Data on the disk is right and integrated
>>
>>
>> Thanks
>> Miao
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-20 Thread Miao Xie

On Tue, 20 Jan 2015 20:10:56 -0500, Chris Mason wrote:
> On Tue, Jan 20, 2015 at 8:09 PM, Qu Wenruo  wrote:
>>
>>  Original Message 
>> Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs
>> to avoid deadlock.
>> From: Chris Mason 
>> To: Qu Wenruo 
>> Date: 2015年01月21日 09:05
>>>
>>>
>>> On Tue, Jan 20, 2015 at 7:58 PM, Qu Wenruo  wrote:

  Original Message 
 Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen
 fs to avoid deadlock.
 From: David Sterba 
 To: Qu Wenruo 
 Date: 2015年01月21日 01:13
> On Mon, Jan 19, 2015 at 03:42:41PM +0800, Qu Wenruo wrote:
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1000,6 +1000,14 @@ int btrfs_sync_fs(struct super_block *sb, int 
>> wait)
>>*/
>>   if (fs_info->pending_changes == 0)
>>   return 0;
>> +/*
>> + * Test if the fs is frozen, or start_trasaction
>> + * will deadlock on itself.
>> + */
>> +if (__sb_start_write(sb, SB_FREEZE_FS, false))
>> +__sb_end_write(sb, SB_FREEZE_FS);
>> +else
>> +return 0;
>>>
>>> But what if someone freezes the FS after __sb_end_write() and before
>>> btrfs_start_transaction()?   I don't see what keeps new freezers from 
>>> coming in.
>>>
>>> -chris
>> Either VFS::freeze_super() and VFS::syncfs() will hold the s_umount mutex, so
>> freeze will not happen
>> during sync.
> 
> You're right.  I was worried about the sync ioctl, but the mutex won't be held
> there to deadlock against.  We'll be fine.

There is another problem which is introduced by pending change. That is we will
start and commmit a transaction by changing pending mount option after we set
the fs to be R/O.

I think it is better that we don't start a new transaction for pending changes
which are set after the transaction is committed, just make them be handled by
the next transaction, the reason is:
- Make the behavior of the fs be consistent(both freezed fs and unfreezed fs)
- Data on the disk is right and integrated


Thanks
Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-20 Thread Miao Xie

On Tue, 20 Jan 2015 11:17:07 +0800, Qu Wenruo wrote:
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1000,6 +1000,14 @@ int btrfs_sync_fs(struct super_block *sb, int 
>> wait)
>> */
>>if (fs_info->pending_changes == 0)
>>return 0;
>> +/*
>> + * Test if the fs is frozen, or start_trasaction
>> + * will deadlock on itself.
>> + */
>> +if (__sb_start_write(sb, SB_FREEZE_FS, false))
>> +__sb_end_write(sb, SB_FREEZE_FS);
>> +else
>> +return 0;
> I'm not sure this is the right fix. We should use either
> mnt_want_write_file or sb_start_write around the start/commit functions.
> The fs may be frozen already, but we also have to catch transition to
> that state, or RO remount.
 But the deadlock between s_umount and frozen level is a larger problem...

 Even Miao mentioned that we can start a transaction in btrfs_freeze(), but
 there is still possibility that
 we try to change the feature of the frozen btrfs and do sync, again the
 deadlock will happen.
 Although handling in btrfs_freeze() is also needed, but can't resolve all 
 the
 problem.

 IMHO the fix is still needed, or at least as a workaround until we find a 
 real
 root solution for it
 (If nobody want to revert the patchset)

 BTW, what about put the pending changes to a workqueue? If we don't start
 transaction under
 s_umount context like sync_fs()
>> I don't like this fix.
>> I think we should deal with those pending changes when we freeze a 
>> filesystem.
>> or we break the rule of fs freeze.
> I am afraid handling it in btrfs_freeze() won't help.
> Case like freeze() -> change_feature -> sync() -> unfreeze() will still 
> deadlock
> in sync().

We should not change feature after the fs is freezed.

> Even cleared the pending changes in freeze(), it can still be set through 
> sysfs
> interface even the fs is frozen.
> 
> And in fact, if we put the things like attach/create a transaction into a
> workqueue, we will not break
> the freeze rule.
> Since if the fs is frozen, there is no running transaction and we need to 
> create
> a new one,
> that will call sb_start_intwrite(), which will sleep until the fs is unfreeze.

I read the pending change code just now, and I found the pending change is just
used for changing the mount option now, so I think as a work-around fix we
needn't start a new transaction to handle the pending flags which are set after
the current transaction is committed, because the data on the disk is
integrated.

Thanks
Miao


> 
> Thanks,
> Qu
>>
>> Thanks
>> Miao
>>
 Thanks,
 Qu
> Also, returning 0 is not right, the ioctl actually skipped the expected
> work.
>
>>trans = btrfs_start_transaction(root, 0);
>>} else {
>>return PTR_ERR(trans);
>>> .
>>>
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-19 Thread Miao Xie

On Tue, 20 Jan 2015 10:53:05 +0800, Qu Wenruo wrote:
> Add CC to Miao Xie 
> 
>  Original Message 
> Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs 
> to
> avoid deadlock.
> From: Qu Wenruo 
> To: dste...@suse.cz, linux-btrfs@vger.kernel.org, Miao Xie 
> 
> Date: 2015年01月20日 10:51
>>
>>  Original Message 
>> Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs
>> to avoid deadlock.
>> From: David Sterba 
>> To: Qu Wenruo 
>> Date: 2015年01月19日 22:06
>>> On Mon, Jan 19, 2015 at 03:42:41PM +0800, Qu Wenruo wrote:
>>>> The fix is to check if the fs is frozen, if the fs is frozen, just
>>>> return and waiting for the next transaction.
>>>>
>>>> --- a/fs/btrfs/super.c
>>>> +++ b/fs/btrfs/super.c
>>>> @@ -1000,6 +1000,14 @@ int btrfs_sync_fs(struct super_block *sb, int wait)
>>>>*/
>>>>   if (fs_info->pending_changes == 0)
>>>>   return 0;
>>>> +/*
>>>> + * Test if the fs is frozen, or start_trasaction
>>>> + * will deadlock on itself.
>>>> + */
>>>> +if (__sb_start_write(sb, SB_FREEZE_FS, false))
>>>> +__sb_end_write(sb, SB_FREEZE_FS);
>>>> +else
>>>> +return 0;
>>> I'm not sure this is the right fix. We should use either
>>> mnt_want_write_file or sb_start_write around the start/commit functions.
>>> The fs may be frozen already, but we also have to catch transition to
>>> that state, or RO remount.
>> But the deadlock between s_umount and frozen level is a larger problem...
>>
>> Even Miao mentioned that we can start a transaction in btrfs_freeze(), but
>> there is still possibility that
>> we try to change the feature of the frozen btrfs and do sync, again the
>> deadlock will happen.
>> Although handling in btrfs_freeze() is also needed, but can't resolve all the
>> problem.
>>
>> IMHO the fix is still needed, or at least as a workaround until we find a 
>> real
>> root solution for it
>> (If nobody want to revert the patchset)
>>
>> BTW, what about put the pending changes to a workqueue? If we don't start
>> transaction under
>> s_umount context like sync_fs()

I don't like this fix.
I think we should deal with those pending changes when we freeze a filesystem.
or we break the rule of fs freeze.

Thanks
Miao

>>
>> Thanks,
>> Qu
>>>
>>> Also, returning 0 is not right, the ioctl actually skipped the expected
>>> work.
>>>
>>>>   trans = btrfs_start_transaction(root, 0);
>>>>   } else {
>>>>   return PTR_ERR(trans);
>>
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-19 Thread Miao Xie

On Mon, 19 Jan 2015 15:42:41 +0800, Qu Wenruo wrote:
> Commit 6b5fe46dfa52 (btrfs: do commit in sync_fs if there are pending
> changes) will call btrfs_start_transaction() in sync_fs(), to handle
> some operations needed to be done in next transaction.
> 
> However this can cause deadlock if the filesystem is frozen, with the
> following sys_r+w output:
> [  143.255932] Call Trace:
> [  143.255936]  [] schedule+0x29/0x70
> [  143.255939]  [] __sb_start_write+0xb3/0x100
> [  143.255971]  [] start_transaction+0x2e6/0x5a0
> [btrfs]
> [  143.255992]  [] btrfs_start_transaction+0x1b/0x20
> [btrfs]
> [  143.256003]  [] btrfs_sync_fs+0xca/0xd0 [btrfs]
> [  143.256007]  [] sync_fs_one_sb+0x20/0x30
> [  143.256011]  [] iterate_supers+0xe1/0xf0
> [  143.256014]  [] sys_sync+0x55/0x90
> [  143.256017]  [] system_call_fastpath+0x12/0x17
> [  143.256111] Call Trace:
> [  143.256114]  [] schedule+0x29/0x70
> [  143.256119]  [] rwsem_down_write_failed+0x1c5/0x2d0
> [  143.256123]  [] call_rwsem_down_write_failed+0x13/0x20
> [  143.256131]  [] thaw_super+0x28/0xc0
> [  143.256135]  [] do_vfs_ioctl+0x3f5/0x540
> [  143.256187]  [] SyS_ioctl+0x91/0xb0
> [  143.256213]  [] system_call_fastpath+0x12/0x17
> 
> The reason is like the following:
> (Holding s_umount)
> VFS sync_fs staff:
> |- btrfs_sync_fs()
>|- btrfs_start_transaction()
>   |- sb_start_intwrite()
>   (Waiting thaw_fs to unfreeze)
>   VFS thaw_fs staff:
>   thaw_fs()
>   (Waiting sync_fs to release
>s_umount)
> 
> So deadlock happens.
> This can be easily triggered by fstest/generic/068 with inode_cache
> mount option.
> 
> The fix is to check if the fs is frozen, if the fs is frozen, just
> return and waiting for the next transaction.
> 
> Cc: David Sterba 
> Reported-by: Gui Hecheng 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/super.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 60f7cbe..1d9f1e6 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1000,6 +1000,14 @@ int btrfs_sync_fs(struct super_block *sb, int wait)
>*/
>   if (fs_info->pending_changes == 0)
>   return 0;


I think the problem is here -- why ->pending_changes is not 0 when the
filesystem is frozen? so I think the reason of this problem is btrfs_freeze
forget to deal with the pending changes, and the correct fix is to correct
the behavior of btrfs_freeze().

Thanks
Miao

> + /*
> +  * Test if the fs is frozen, or start_trasaction
> +  * will deadlock on itself.
> +  */
> + if (__sb_start_write(sb, SB_FREEZE_FS, false))
> + __sb_end_write(sb, SB_FREEZE_FS);
> + else
> + return 0;
>   trans = btrfs_start_transaction(root, 0);
>   } else {
>   return PTR_ERR(trans);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix typo of variable in scrub_stripe

2015-01-09 Thread Miao Xie

On Fri, 09 Jan 2015 17:37:52 +0900, Tsutomu Itoh wrote:
> The address that should be freed is not 'ppath' but 'path'.
> 
> Signed-off-by: Tsutomu Itoh  
> ---
>  fs/btrfs/scrub.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index f2bb13a..403fbdb 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3053,7 +3053,7 @@ static noinline_for_stack int scrub_stripe(struct 
> scrub_ctx *sctx,
>  
>   ppath = btrfs_alloc_path();
>   if (!ppath) {
> - btrfs_free_path(ppath);
> +         btrfs_free_path(path);

My bad. Thanks to fix it.

Reviewed-by: Miao Xie 

>   return -ENOMEM;
>   }
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: fix raid56 scrub failed in xfstests btrfs/072

2015-01-08 Thread Miao Xie

On Fri, 09 Jan 2015 09:39:40 +0800, Gui Hecheng wrote:
> The xfstests btrfs/072 reports uncorrectable read errors in dmesg,
> because scrub forgets to use commit_root for parity scrub routine
> and scrub attempts to scrub those extents items whose contents are
> not fully on disk.
> 
> To fix it, we just add the @search_commit_root flag back.

Reviewed-by: Miao Xie 

> 
> Signed-off-by: Gui Hecheng 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/scrub.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index f2bb13a..aa8ff75 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3065,6 +3065,8 @@ static noinline_for_stack int scrub_stripe(struct 
> scrub_ctx *sctx,
>   path->search_commit_root = 1;
>   path->skip_locking = 1;
>  
> + ppath->search_commit_root = 1;
> + ppath->skip_locking = 1;
>   /*
>* trigger the readahead for extent tree csum tree and wait for
>* completion. During readahead, the scrub is officially paused
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: delete chunk allocation attemp when setting block group ro

2015-01-08 Thread Miao Xie

On Thu, 08 Jan 2015 18:06:50 -0800, Shaohua Li wrote:
> On Fri, Jan 09, 2015 at 09:01:57AM +0800, Miao Xie wrote:
>> On Thu, 08 Jan 2015 13:23:13 -0800, Shaohua Li wrote:
>>> Below test will fail currently:
>>>   mkfs.ext4 -F /dev/sda
>>>   btrfs-convert /dev/sda
>>>   mount /dev/sda /mnt
>>>   btrfs device add -f /dev/sdb /mnt
>>>   btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt
>>>
>>> The reason is there are some block groups with usage 0, but the whole
>>> disk hasn't free space to allocate new chunk, so we even can't set such
>>> block group readonly. This patch deletes the chunk allocation when
>>> setting block group ro. For META, we already have reserve. But for
>>> SYSTEM, we don't have, so the check_system_chunk is still required.
>>>
>>> Signed-off-by: Shaohua Li 
>>> ---
>>>  fs/btrfs/extent-tree.c | 31 +++
>>>  1 file changed, 7 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index a80b971..430101b6 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -8493,22 +8493,8 @@ static int set_block_group_ro(struct 
>>> btrfs_block_group_cache *cache, int force)
>>>  {
>>> struct btrfs_space_info *sinfo = cache->space_info;
>>> u64 num_bytes;
>>> -   u64 min_allocable_bytes;
>>> int ret = -ENOSPC;
>>>  
>>> -
>>> -   /*
>>> -* We need some metadata space and system metadata space for
>>> -* allocating chunks in some corner cases until we force to set
>>> -* it to be readonly.
>>> -*/
>>> -   if ((sinfo->flags &
>>> -(BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_METADATA)) &&
>>> -   !force)
>>> -   min_allocable_bytes = 1 * 1024 * 1024;
>>> -   else
>>> -   min_allocable_bytes = 0;
>>> -
>>> spin_lock(&sinfo->lock);
>>> spin_lock(&cache->lock);
>>>  
[SNIP]
>>> ret = set_block_group_ro(cache, 0);
>>> if (!ret)
>>> goto out;
>>> @@ -8566,6 +8544,11 @@ int btrfs_set_block_group_ro(struct btrfs_root *root,
>>> goto out;
>>> ret = set_block_group_ro(cache, 0);
>>>  out:
>>> +   if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
>>> +   alloc_flags = update_block_group_flags(root, cache->flags);
>>> +   check_system_chunk(trans, root, alloc_flags);
>>
>> Please consider the case that the following patch fixed
>>   199c36eaa95077a47ae1bc55532fc0fbeb80cc95
>>
>> If there is no free device space, check_system_chunk can not allocate
>> new system metadata chunk, so when we run final step of the chunk
>> allocation to update the device item and insert the new chunk item, we
>> would fail.
> 
> So the relocation will always fail in this case. The check just makes
> the failure earlier, right? We don't have the BUG_ON in
> do_chunk_alloc() currently.

The final step of the chunk allocation is a delayed operation, we must make sure
it can be done successfully, or we would abort the transaction, make the
filesystem readonly and lose the data that is written into the filesystem before
we do balance, it would make the users unconfortable.

With this patch, we will set the block group successfully at the first time we
invoke set_block_group_ro(). But if the block group that will be set to RO is
the only system metadata block group in the filesystem, and there is no device
space to allocate a new one, that is we have no space to deal with the pending
final step of chunk allocation, so the problem I said above will happen.

Thanks
Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: delete chunk allocation attemp when setting block group ro

2015-01-08 Thread Miao Xie

On Thu, 08 Jan 2015 13:23:13 -0800, Shaohua Li wrote:
> Below test will fail currently:
>   mkfs.ext4 -F /dev/sda
>   btrfs-convert /dev/sda
>   mount /dev/sda /mnt
>   btrfs device add -f /dev/sdb /mnt
>   btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt
> 
> The reason is there are some block groups with usage 0, but the whole
> disk hasn't free space to allocate new chunk, so we even can't set such
> block group readonly. This patch deletes the chunk allocation when
> setting block group ro. For META, we already have reserve. But for
> SYSTEM, we don't have, so the check_system_chunk is still required.
> 
> Signed-off-by: Shaohua Li 
> ---
>  fs/btrfs/extent-tree.c | 31 +++
>  1 file changed, 7 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a80b971..430101b6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8493,22 +8493,8 @@ static int set_block_group_ro(struct 
> btrfs_block_group_cache *cache, int force)
>  {
>   struct btrfs_space_info *sinfo = cache->space_info;
>   u64 num_bytes;
> - u64 min_allocable_bytes;
>   int ret = -ENOSPC;
>  
> -
> - /*
> -  * We need some metadata space and system metadata space for
> -  * allocating chunks in some corner cases until we force to set
> -  * it to be readonly.
> -  */
> - if ((sinfo->flags &
> -  (BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_METADATA)) &&
> - !force)
> - min_allocable_bytes = 1 * 1024 * 1024;
> - else
> - min_allocable_bytes = 0;
> -
>   spin_lock(&sinfo->lock);
>   spin_lock(&cache->lock);
>  
> @@ -8521,8 +8507,8 @@ static int set_block_group_ro(struct 
> btrfs_block_group_cache *cache, int force)
>   cache->bytes_super - btrfs_block_group_used(&cache->item);
>  
>   if (sinfo->bytes_used + sinfo->bytes_reserved + sinfo->bytes_pinned +
> - sinfo->bytes_may_use + sinfo->bytes_readonly + num_bytes +
> - min_allocable_bytes <= sinfo->total_bytes) {
> + sinfo->bytes_may_use + sinfo->bytes_readonly + num_bytes
> + <= sinfo->total_bytes) {
>   sinfo->bytes_readonly += num_bytes;
>   cache->ro = 1;
>   list_add_tail(&cache->ro_list, &sinfo->ro_bgs);
> @@ -8548,14 +8534,6 @@ int btrfs_set_block_group_ro(struct btrfs_root *root,
>   if (IS_ERR(trans))
>   return PTR_ERR(trans);
>  
> - alloc_flags = update_block_group_flags(root, cache->flags);
> - if (alloc_flags != cache->flags) {
> - ret = do_chunk_alloc(trans, root, alloc_flags,
> -  CHUNK_ALLOC_FORCE);
> - if (ret < 0)
> - goto out;
> - }
> -
>   ret = set_block_group_ro(cache, 0);
>   if (!ret)
>   goto out;
> @@ -8566,6 +8544,11 @@ int btrfs_set_block_group_ro(struct btrfs_root *root,
>   goto out;
>   ret = set_block_group_ro(cache, 0);
>  out:
> + if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
> + alloc_flags = update_block_group_flags(root, cache->flags);
> + check_system_chunk(trans, root, alloc_flags);

Please consider the case that the following patch fixed
  199c36eaa95077a47ae1bc55532fc0fbeb80cc95

If there is no free device space, check_system_chunk can not allocate
new system metadata chunk, so when we run final step of the chunk
allocation to update the device item and insert the new chunk item, we
would fail.

Thanks
Miao

> + }
> +
>   btrfs_end_transaction(trans, root);
>   return ret;
>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/10] Implement device scrub/replace for RAID56

2014-12-02 Thread Miao Xie

On Tue, 2 Dec 2014 08:28:22 -0500, Chris Mason wrote:
> On Tue, Dec 2, 2014 at 7:39 AM, Miao Xie  wrote:
>> This patchset implement the device scrub/replace function for RAID56, the
>> most implementation of the common data is similar to the other RAID type.
>> The differentia or difficulty is the parity process. The basic idea is 
>> reading
>> and check the data which has checksum out of the raid56 stripe lock, if the
>> data is right, then lock the raid56 stripe, read out the other data in the
>> same stripe, if no IO error happens, calculate the parity and check the
>> original one, if the original parity is right, the scrub parity passes.
>> or write out the new one. But if the common data(not parity) that we read out
>> is wrong, we will try to recover it, and then check and repair the parity.
>>
>> And in order to avoid making the code more and more complex, we copy some
>> code of common data process for the parity, the cleanup work is in my
>> TODO list.
>>
>> We have done some test, the patchset worked well. Of course, more tests
>> are welcome. If you are interesting to use it or test it, you can pull
>> the patchset from
>>
>>   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace
>>
>> Changelog v3 -> v4:
>> - Fix the problem that the scrub's raid bio was cached, which was reported
>>   by Chris.
>> - Remove the 10st patch, the deadlock that was described in that patch 
>> doesn't
>>   exist on the current kernel.
>> - Rebase the patchset to the top of integration branch
> 
> Thanks, I'll try this today.  I need to rebase in a new version of the RCU 
> patches, can you please cook one on top of v3.18-rc6 instead?

I have updated my raid56-scrub-replace branch, please re-pull it.
  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

The v4 patchset in the mail list can be applied on v3.18-rc6 successfully, so
I don't update it.

Thanks
Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/10] Implement device scrub/replace for RAID56

2014-12-02 Thread Miao Xie

On Tue, 2 Dec 2014 08:28:22 -0500, Chris Mason wrote:
> 
> 
> On Tue, Dec 2, 2014 at 7:39 AM, Miao Xie  wrote:
>> This patchset implement the device scrub/replace function for RAID56, the
>> most implementation of the common data is similar to the other RAID type.
>> The differentia or difficulty is the parity process. The basic idea is 
>> reading
>> and check the data which has checksum out of the raid56 stripe lock, if the
>> data is right, then lock the raid56 stripe, read out the other data in the
>> same stripe, if no IO error happens, calculate the parity and check the
>> original one, if the original parity is right, the scrub parity passes.
>> or write out the new one. But if the common data(not parity) that we read out
>> is wrong, we will try to recover it, and then check and repair the parity.
>>
>> And in order to avoid making the code more and more complex, we copy some
>> code of common data process for the parity, the cleanup work is in my
>> TODO list.
>>
>> We have done some test, the patchset worked well. Of course, more tests
>> are welcome. If you are interesting to use it or test it, you can pull
>> the patchset from
>>
>>   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace
>>
>> Changelog v3 -> v4:
>> - Fix the problem that the scrub's raid bio was cached, which was reported
>>   by Chris.
>> - Remove the 10st patch, the deadlock that was described in that patch 
>> doesn't
>>   exist on the current kernel.
>> - Rebase the patchset to the top of integration branch
> 
> Thanks, I'll try this today.  I need to rebase in a new version of the RCU 
> patches, can you please cook one on top of v3.18-rc6 instead?

No problem.

Thanks
Miao

> 
> -chris
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

2014-12-02 Thread Miao Xie

hi, Chris

On Fri, 28 Nov 2014 16:32:03 -0500, Chris Mason wrote:
> On Wed, Nov 26, 2014 at 10:00 PM, Miao Xie  wrote:
>> On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote:
>>>  On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
>>>>  On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie  wrote:
>>>>>  The increase/decrease of bio counter is on the I/O path, so we should
>>>>>  use io_schedule() instead of schedule(), or the deadlock might be
>>>>>  triggered by the pending I/O in the plug list. io_schedule() can help
>>>>>  us because it will flush all the pending I/O before the task is going
>>>>>  to sleep.
>>>>
>>>>  Can you please describe this deadlock in more detail?  schedule() also 
>>>> triggers
>>>>  a flush of the plug list, and if that's no longer sufficient we can run 
>>>> into other
>>>>  problems (especially with preemption on).
>>>
>>>  Sorry for my miss. I forgot to check the current implementation of 
>>> schedule(), which flushes the plug list unconditionally. Please ignore this 
>>> patch.
>>
>> I have updated my raid56-scrub-replace branch, please re-pull the branch.
>>
>>   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace
> 
> Sorry, I wasn't clear.  I do like the patch because it uses a slightly better 
> trigger mechanism for the flush.  I was just worried about a larger deadlock.
> 
> I ran the raid56 work with stress.sh overnight, then scrubbed the resulting 
> filesystem and ran balance when the scrub completed.  All of these passed 
> without errors (excellent!).
> 
> Then I zero'd 4GB of one drive and ran scrub again.  This was the result.  
> Please make sure CONFIG_DEBUG_PAGEALLOC is enabled and you should be able to 
> reproduce.

I sent out the 4th version of the patchset, please try it.

I have pushed the new patchset to my git tree, you can re-pull it.
  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

> 
> [192392.495260] BUG: unable to handle kernel paging request at 
> 880303062f80
> [192392.495279] IP: [] lock_stripe_add+0xba/0x390 [btrfs]
> [192392.495281] PGD 2bdb067 PUD 107e7fd067 PMD 107e7e4067 PTE 800303062060
> [192392.495283] Oops:  [#1] SMP DEBUG_PAGEALLOC
> [192392.495307] Modules linked in: ipmi_devintf loop fuse k10temp coretemp 
> hwmon btrfs raid6_pq zlib_deflate lzo_compress xor xfs exportfs libcrc32c 
> tcp_diag inet_diag nfsv4 ip6table_filter ip6_tables xt_NFLOG nfnetlink_log 
> nfnetlink xt_comment xt_statistic iptable_filter ip_tables x_tables mptctl 
> netconsole autofs4 nfsv3 nfs lockd grace rpcsec_gss_krb5 auth_rpcgss 
> oid_registry sunrpc ipv6 ext3 jbd dm_mod rtc_cmos ipmi_si ipmi_msghandler 
> iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 lpc_ich mfd_core shpchp ehci_pci 
> ehci_hcd mlx4_en ptp pps_core mlx4_core sg ses enclosure button megaraid_sas
> [192392.495310] CPU: 0 PID: 11992 Comm: kworker/u65:2 Not tainted 
> 3.18.0-rc6-mason+ #7
> [192392.495310] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 
> 1.07 05/10/2012
> [192392.495323] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
> [192392.495324] task: 88013dae9110 ti: 8802296a task.ti: 
> 8802296a
> [192392.495335] RIP: 0010:[]  [] 
> lock_stripe_add+0xba/0x390 [btrfs]
> [192392.495335] RSP: 0018:8802296a3ac8  EFLAGS: 00010006
> [192392.495336] RAX: 880577e85018 RBX: 880497f0b2f8 RCX: 
> 8801190fb000
> [192392.495337] RDX: 013d RSI: 880303062f80 RDI: 
> 040c275a
> [192392.495338] RBP: 8802296a3b48 R08: 880497f0 R09: 
> 0001
> [192392.495339] R10:  R11:  R12: 
> 0282
> [192392.495339] R13: b250 R14: 880577e85000 R15: 
> 880497f0b2a0
> [192392.495340] FS:  () GS:88085fc0() 
> knlGS:
> [192392.495341] CS:  0010 DS:  ES:  CR0: 80050033
> [192392.495342] CR2: 880303062f80 CR3: 05289000 CR4: 
> 000407f0
> [192392.495342] Stack:
> [192392.495344]  880755e28000 880497f0 013d 
> 8801190fb000
> [192392.495346]   88013dae9110 81090d40 
> 8802296a3b00
> [192392.495347]  8802296a3b00 0010 8802296a3b68 
> 8801190fb000
> [192392.495348] Call Trace:
> [192392.495353]  [] ? bit_waitqueue+0xa0/0xa0
> [192392.495363]  [] 
> raid56_parity_submit_scrub_rbio+0x16/0x30 [btrfs]
> [192392.495372]  [] 
> scrub_parity_check_and_repair+0x15e/0x1e0 [btrfs]
> [192392.495380]  [] scrub_block_put

[PATCH v4 05/10] Btrfs, raid56: use a variant to record the operation type

2014-12-02 Thread Miao Xie

We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None.
---
 fs/btrfs/raid56.c | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c954537..4924388 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -69,6 +69,11 @@
 
 #define RBIO_CACHE_SIZE 1024
 
+enum btrfs_rbio_ops {
+   BTRFS_RBIO_WRITE= 0,
+   BTRFS_RBIO_READ_REBUILD = 1,
+};
+
 struct btrfs_raid_bio {
struct btrfs_fs_info *fs_info;
struct btrfs_bio *bbio;
@@ -131,7 +136,7 @@ struct btrfs_raid_bio {
 * differently from a parity rebuild as part of
 * rmw
 */
-   int read_rebuild;
+   enum btrfs_rbio_ops operation;
 
/* first bad stripe */
int faila;
@@ -154,7 +159,6 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
-
atomic_t stripes_pending;
 
atomic_t error;
@@ -590,8 +594,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
return 0;
 
/* reads can't merge with writes */
-   if (last->read_rebuild !=
-   cur->read_rebuild) {
+   if (last->operation != cur->operation) {
return 0;
}
 
@@ -784,9 +787,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
spin_unlock(&rbio->bio_list_lock);
spin_unlock_irqrestore(&h->lock, flags);
 
-   if (next->read_rebuild)
+   if (next->operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else {
+   else if (next->operation == BTRFS_RBIO_WRITE){
steal_rbio(rbio, next);
async_rmw_stripe(next);
}
@@ -1720,6 +1723,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
}
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
+   rbio->operation = BTRFS_RBIO_WRITE;
 
/*
 * don't plug on full rbios, just get them out the door
@@ -1768,7 +1772,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
faila = rbio->faila;
failb = rbio->failb;
 
-   if (rbio->read_rebuild) {
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
spin_lock_irq(&rbio->bio_list_lock);
set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
spin_unlock_irq(&rbio->bio_list_lock);
@@ -1785,7 +1789,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio->read_rebuild &&
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD &&
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1878,7 +1882,7 @@ pstripe:
 * know they can be trusted.  If this was a read reconstruction,
 * other endio functions will fiddle the uptodate bits
 */
-   if (!rbio->read_rebuild) {
+   if (rbio->operation == BTRFS_RBIO_WRITE) {
for (i = 0;  i < nr_pages; i++) {
if (faila != -1) {
page = rbio_stripe_page(rbio, faila, i);
@@ -1895,7 +1899,7 @@ pstripe:
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio->read_rebuild &&
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD &&
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1910,8 +1914,7 @@ cleanup:
kfree(pointers);
 
 cleanup_io:
-
-   if (rbio->read_rebuild) {
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
if (err == 0 &&
!test_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags))
cache_rbio_pages(rbio);
@@ -2050,7 +2053,7 @@ out:
return 0;
 
 cleanup:
-

[PATCH v4 09/10] Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56

2014-12-02 Thread Miao Xie

The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie 
---
Changelog v3 -> v4:
- None.

Changelog v2 -> v3:
- New patch to fix undealt use-after-free problem of the source device
  in the final device replace procedure.

Changelog v1 -> v2:
- None.
---
 fs/btrfs/ctree.h   |  7 ++-
 fs/btrfs/dev-replace.c |  4 ++--
 fs/btrfs/raid56.c  | 41 -
 fs/btrfs/raid56.h  |  4 ++--
 fs/btrfs/scrub.c   |  2 +-
 fs/btrfs/volumes.c |  7 ++-
 6 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fc73e86..3770f4c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4156,7 +4156,12 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64 
devid,
 /* dev-replace.c */
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
 void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info);
-void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info);
+void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount);
+
+static inline void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
+{
+   btrfs_bio_counter_sub(fs_info, 1);
+}
 
 /* reada.c */
 struct reada_control {
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 91f6b8f..326919b 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -928,9 +928,9 @@ void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info 
*fs_info)
percpu_counter_inc(&fs_info->bio_counter);
 }
 
-void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
+void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
 {
-   percpu_counter_dec(&fs_info->bio_counter);
+   percpu_counter_sub(&fs_info->bio_counter, amount);
 
if (waitqueue_active(&fs_info->replace_wait))
wake_up(&fs_info->replace_wait);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7e6f239..44573bf 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -162,6 +162,8 @@ struct btrfs_raid_bio {
 */
int bio_list_bytes;
 
+   int generic_bio_cnt;
+
atomic_t refs;
 
atomic_t stripes_pending;
@@ -354,6 +356,7 @@ static void merge_rbio(struct btrfs_raid_bio *dest,
 {
bio_list_merge(&dest->bio_list, &victim->bio_list);
dest->bio_list_bytes += victim->bio_list_bytes;
+   dest->generic_bio_cnt += victim->generic_bio_cnt;
bio_list_init(&victim->bio_list);
 }
 
@@ -891,6 +894,10 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, 
int err, int uptodate)
 {
struct bio *cur = bio_list_get(&rbio->bio_list);
struct bio *next;
+
+   if (rbio->generic_bio_cnt)
+   btrfs_bio_counter_sub(rbio->fs_info, rbio->generic_bio_cnt);
+
free_raid_bio(rbio);
 
while (cur) {
@@ -1775,6 +1782,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct btrfs_raid_bio *rbio;
struct btrfs_plug_cb *plug = NULL;
struct blk_plug_cb *cb;
+   int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
if (IS_ERR(rbio)) {
@@ -1785,12 +1793,19 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
rbio->bio_list_bytes = bio->bi_iter.bi_size;
rbio->operation = BTRFS_RBIO_WRITE;
 
+   btrfs_bio_counter_inc_noblocked(root->fs_info);
+   rbio->generic_bio_cnt = 1;
+
/*
 * don't plug on full rbios, just get them out the door
 * as quickly as we can
 */
-   if (rbio_is_full(rbio))
-   return full_stripe_write(rbio);
+   if (rbio_is_full(rbio)) {
+   ret = full_stripe_write(rbio);
+   if (ret)
+   btrfs_bio_counter_dec(root->fs_info);
+   return ret;
+   }
 
cb = blk_check_plugged(btrfs_raid_unplug, root->fs_info,
   sizeof(*plug));
@@ -1801,10 +1816,13 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
INIT_LIST_HEAD(&plug->rbio_list);
}
list_add_tail(&rbio->plug_list, &plug->rbio_list);
+   ret = 0;
} else {
-   return __raid56_parity_write(rbio);
+

[PATCH v4 02/10] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block

2014-12-02 Thread Miao Xie

From: Zhao Lei 

stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei 
Signed-off-by: Miao Xie 
Reviewed-by: David Sterba 
---
Changelog v1 -> v4:
- None.
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6f80aef..eeb5b31 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5172,9 +5172,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
 
/* push stripe_nr back to the start of the full stripe 
*/
stripe_nr = raid56_full_stripe_start;
-   do_div(stripe_nr, stripe_len);
-
-   stripe_index = do_div(stripe_nr, nr_data_stripes(map));
+   do_div(stripe_nr, stripe_len * nr_data_stripes(map));
 
/* RAID[56] write or recovery. Return all stripes */
num_stripes = map->num_stripes;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 07/10] Btrfs, replace: write dirty pages into the replace target device

2014-12-02 Thread Miao Xie

The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None.
---
 fs/btrfs/raid56.c  | 104 +
 fs/btrfs/volumes.c |  26 --
 fs/btrfs/volumes.h |  10 --
 3 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 58a8408..16fe456 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -131,6 +131,8 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int real_stripes;
+
int stripe_npages;
/*
 * set if we're doing a parity rebuild
@@ -638,7 +640,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio 
*rbio, int index)
  */
 static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index)
 {
-   if (rbio->nr_data + 1 == rbio->bbio->num_stripes)
+   if (rbio->nr_data + 1 == rbio->real_stripes)
return NULL;
 
index += ((rbio->nr_data + 1) * rbio->stripe_len) >>
@@ -981,7 +983,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 {
struct btrfs_raid_bio *rbio;
int nr_data = 0;
-   int num_pages = rbio_nr_pages(stripe_len, bbio->num_stripes);
+   int real_stripes = bbio->num_stripes - bbio->num_tgtdevs;
+   int num_pages = rbio_nr_pages(stripe_len, real_stripes);
int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
void *p;
 
@@ -1001,6 +1004,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio->fs_info = root->fs_info;
rbio->stripe_len = stripe_len;
rbio->nr_pages = num_pages;
+   rbio->real_stripes = real_stripes;
rbio->stripe_npages = stripe_npages;
rbio->faila = -1;
rbio->failb = -1;
@@ -1017,10 +1021,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio->bio_pages = p + sizeof(struct page *) * num_pages;
rbio->dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
-   if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
-   nr_data = bbio->num_stripes - 2;
+   if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE)
+   nr_data = real_stripes - 2;
else
-   nr_data = bbio->num_stripes - 1;
+   nr_data = real_stripes - 1;
 
rbio->nr_data = nr_data;
return rbio;
@@ -1132,7 +1136,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
 {
if (rbio->faila >= 0 || rbio->failb >= 0) {
-   BUG_ON(rbio->faila == rbio->bbio->num_stripes - 1);
+   BUG_ON(rbio->faila == rbio->real_stripes - 1);
__raid56_parity_recover(rbio);
} else {
finish_rmw(rbio);
@@ -1193,7 +1197,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
struct btrfs_bio *bbio = rbio->bbio;
-   void *pointers[bbio->num_stripes];
+   void *pointers[rbio->real_stripes];
int stripe_len = rbio->stripe_len;
int nr_data = rbio->nr_data;
int stripe;
@@ -1207,11 +1211,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
 
bio_list_init(&bio_list);
 
-   if (bbio->num_stripes - rbio->nr_data == 1) {
-   p_stripe = bbio->num_stripes - 1;
-   } else if (bbio->num_stripes - rbio->nr_data == 2) {
-   p_stripe = bbio->num_stripes - 2;
-   q_stripe = bbio->num_stripes - 1;
+   if (rbio->real_stripes - rbio->nr_data == 1) {
+   p_stripe = rbio->real_stripes - 1;
+   } else if (rbio->real_stripes - rbio->nr_data == 2) {
+   p_stripe = rbio->real_stripes - 2;
+   q_stripe = rbio->real_stripes - 1;
} else {
BUG();
}
@@ -1268,7 +1272,7 @@ static noinline void finish_rmw(struct btrfs_rai

[PATCH v4 10/10] Btrfs, replace: enable dev-replace for raid56

2014-12-02 Thread Miao Xie

From: Zhao Lei 

Signed-off-by: Zhao Lei 
Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None.
---
 fs/btrfs/dev-replace.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 326919b..51133ea 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
struct btrfs_device *tgt_device = NULL;
struct btrfs_device *src_device = NULL;
 
-   if (btrfs_fs_incompat(fs_info, RAID56)) {
-   btrfs_warn(fs_info, "dev_replace cannot yet handle 
RAID5/RAID6");
-   return -EOPNOTSUPP;
-   }
-
switch (args->start.cont_reading_from_srcdev_mode) {
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 04/10] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

2014-12-02 Thread Miao Xie

This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie 
---
Changelog v3 -> v4:
- Fix the problem that the scrub's raid bio was cached, which was reported by
  Chris.

Changelog v2 -> v3:
- None.

Changelog v1 -> v2:
- Change some function names in raid56.c to make them fit the code style
  of the raid56.
---
 fs/btrfs/raid56.c  |  52 ++
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 -
 fs/btrfs/volumes.c |  16 -
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 235 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index cb31cc6..c954537 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,15 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
+/*
+ * bbio and raid_map is managed by the caller, so we shouldn't free
+ * them here. And besides that, all rbios with this flag should not
+ * be cached, because we need raid_map to check the rbios' stripe
+ * is the same or not, but it is very likely that the caller has
+ * free raid_map, so don't cache those rbios.
+ */
+#define RBIO_HOLD_BBIO_MAP_BIT 4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +808,21 @@ done_nolock:
remove_rbio_from_cache(rbio);
 }
 
+static inline void
+__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+   if (need) {
+   kfree(raid_map);
+   kfree(bbio);
+   }
+}
+
+static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+   __free_bbio_and_raid_map(rbio->bbio, rbio->raid_map,
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
@@ -817,8 +841,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
rbio->stripe_pages[i] = NULL;
}
}
-   kfree(rbio->raid_map);
-   kfree(rbio->bbio);
+
+   free_bbio_and_raid_map(rbio);
+
kfree(rbio);
 }
 
@@ -933,11 +958,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 
rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
GFP_NOFS);
-   if (!rbio) {
-   kfree(raid_map);
-   kfree(bbio);
+   if (!rbio)
return ERR_PTR(-ENOMEM);
-   }
 
bio_list_init(&rbio->bio_list);
INIT_LIST_HEAD(&rbio->plug_list);
@@ -1692,8 +1714,10 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct blk_plug_cb *cb;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, 1);
return PTR_ERR(rbio);
+   }
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
 
@@ -1888,7 +1912,8 @@ cleanup:
 cleanup_io:
 
if (rbio->read_rebuild) {
-   if (err == 0)
+   if (err == 0 &&
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags))
cache_rbio_pages(rbio);
else
clear_bit(RBIO_CACHE_READY_BIT, &rbio->flags);
@@ -2038,15 +2063,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
  struct btrfs_bio *bbio, u64 *raid_map,
- u64 stripe_len, int mirror_num)
+ u64 stripe_len, int mirror_num, int hold_bbio)
 {
struct btrfs_raid_bio *rbio;
int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
return PTR_ERR(rbio);
+   }
 
+   if (hold_bbio)
+   set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
rbio->read_rebuild = 1;
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
@@ -2054,8 +2083,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
rbio->faila = find_logical_bio_stripe(rbio, bio);
if (rbio->faila == -1) {
BUG();
-   kfree(raid_map);
-   kfree(bbio);
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
kfree(rbio);
return -EIO;
}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7

[PATCH v4 06/10] Btrfs, raid56: support parity scrub on raid56

2014-12-02 Thread Miao Xie

The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie 
---
Changelog v3 -> v4:
- None.

Changelog v2 -> v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 -> v2:
- None.
---
 fs/btrfs/raid56.c | 514 -
 fs/btrfs/raid56.h |  12 ++
 fs/btrfs/scrub.c  | 609 --
 3 files changed, 1115 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 4924388..58a8408 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -72,6 +72,7 @@
 enum btrfs_rbio_ops {
BTRFS_RBIO_WRITE= 0,
BTRFS_RBIO_READ_REBUILD = 1,
+   BTRFS_RBIO_PARITY_SCRUB = 2,
 };
 
 struct btrfs_raid_bio {
@@ -130,6 +131,7 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int stripe_npages;
/*
 * set if we're doing a parity rebuild
 * for a read from higher up, which is handled
@@ -144,6 +146,7 @@ struct btrfs_raid_bio {
/* second bad stripe (for raid6 use) */
int failb;
 
+   int scrubp;
/*
 * number of pages needed to represent the full
 * stripe
@@ -178,6 +181,11 @@ struct btrfs_raid_bio {
 * here for faster lookup
 */
struct page **bio_pages;
+
+   /*
+* bitmap to record which horizontal stripe has data
+*/
+   unsigned long *dbitmap;
 };
 
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio);
@@ -192,6 +200,10 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
 static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
+static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
+int need_check);
+static void async_scrub_parity(struct btrfs_raid_bio *rbio);
+
 /*
  * the stripe hash table is used for locking, and to collect
  * bios in hopes of making a full stripe
@@ -593,10 +605,20 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
cur->raid_map[0])
return 0;
 
-   /* reads can't merge with writes */
-   if (last->operation != cur->operation) {
+   /* we can't merge with different operations */
+   if (last->operation != cur->operation)
+   return 0;
+   /*
+* We've need read the full stripe from the drive.
+* check and repair the parity and write the new results.
+*
+* We're not allowed to add any new bios to the
+* bio list here, anyone else that wants to
+* change this stripe needs to do their own rmw.
+*/
+   if (last->operation == BTRFS_RBIO_PARITY_SCRUB ||
+   cur->operation == BTRFS_RBIO_PARITY_SCRUB)
return 0;
-   }
 
return 1;
 }
@@ -789,9 +811,12 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
 
if (next->operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else if (next->operation == BTRFS_RBIO_WRITE){
+   else if (next->operation == BTRFS_RBIO_WRITE) {
steal_rbio(rbio, next);

[PATCH v4 08/10] Btrfs, replace: write raid56 parity into the replace target device

2014-12-02 Thread Miao Xie

This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None.
---
 fs/btrfs/raid56.c | 23 +++
 fs/btrfs/scrub.c  |  2 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 16fe456..7e6f239 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2318,7 +2318,9 @@ static void raid_write_parity_end_io(struct bio *bio, int 
err)
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 int need_check)
 {
+   struct btrfs_bio *bbio = rbio->bbio;
void *pointers[rbio->real_stripes];
+   DECLARE_BITMAP(pbitmap, rbio->stripe_npages);
int nr_data = rbio->nr_data;
int stripe;
int pagenr;
@@ -2328,6 +2330,7 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
struct page *q_page = NULL;
struct bio_list bio_list;
struct bio *bio;
+   int is_replace = 0;
int ret;
 
bio_list_init(&bio_list);
@@ -2341,6 +2344,11 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
BUG();
}
 
+   if (bbio->num_tgtdevs && bbio->tgtdev_map[rbio->scrubp]) {
+   is_replace = 1;
+   bitmap_copy(pbitmap, rbio->dbitmap, rbio->stripe_npages);
+   }
+
/*
 * Because the higher layers(scrubber) are unlikely to
 * use this area of the disk again soon, so don't cache
@@ -2429,6 +2437,21 @@ writeback:
goto cleanup;
}
 
+   if (!is_replace)
+   goto submit_write;
+
+   for_each_set_bit(pagenr, pbitmap, rbio->stripe_npages) {
+   struct page *page;
+
+   page = rbio_stripe_page(rbio, rbio->scrubp, pagenr);
+   ret = rbio_add_io_page(rbio, &bio_list, page,
+  bbio->tgtdev_map[rbio->scrubp],
+  pagenr, rbio->stripe_len);
+   if (ret)
+   goto cleanup;
+   }
+
+submit_write:
nr_data = bio_list_size(&bio_list);
if (!nr_data) {
/* Every parity is right */
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index e3f0b0f..1d6f16a 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2715,7 +2715,7 @@ static void scrub_parity_check_and_repair(struct 
scrub_parity *sparity)
goto out;
 
length = sparity->logic_end - sparity->logic_start + 1;
-   ret = btrfs_map_sblock(sctx->dev_root->fs_info, REQ_GET_READ_MIRRORS,
+   ret = btrfs_map_sblock(sctx->dev_root->fs_info, WRITE,
   sparity->logic_start,
   &length, &bbio, 0, &raid_map);
if (ret || !bbio || !raid_map)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 01/10] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition

2014-12-02 Thread Miao Xie

From: Zhao Lei 

bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909 goto out;

Signed-off-by: Zhao Lei 
Signed-off-by: Miao Xie 
Reviewed-by: David Sterba 
---
Changelog v1 -> v4:
- None.
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 54db1fb..6f80aef 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5167,8 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
BTRFS_BLOCK_GROUP_RAID6)) {
u64 tmp;
 
-   if (bbio_ret && ((rw & REQ_WRITE) || mirror_num > 1)
-   && raid_map_ret) {
+   if (raid_map_ret && ((rw & REQ_WRITE) || mirror_num > 1)) {
int i, rot;
 
/* push stripe_nr back to the start of the full stripe 
*/
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 00/10] Implement device scrub/replace for RAID56

2014-12-02 Thread Miao Xie

This patchset implement the device scrub/replace function for RAID56, the
most implementation of the common data is similar to the other RAID type.
The differentia or difficulty is the parity process. The basic idea is reading
and check the data which has checksum out of the raid56 stripe lock, if the
data is right, then lock the raid56 stripe, read out the other data in the
same stripe, if no IO error happens, calculate the parity and check the
original one, if the original parity is right, the scrub parity passes.
or write out the new one. But if the common data(not parity) that we read out
is wrong, we will try to recover it, and then check and repair the parity.

And in order to avoid making the code more and more complex, we copy some
code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more tests
are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Changelog v3 -> v4:
- Fix the problem that the scrub's raid bio was cached, which was reported
  by Chris.
- Remove the 10st patch, the deadlock that was described in that patch doesn't
  exist on the current kernel.
- Rebase the patchset to the top of integration branch

Changelog v2 -> v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Fix possible deadlock caused by the pending bios in the plug list
  when the io submitters were going to sleep.
- Fix undealt use-after-free problem of the source device in the final
  device replace procedure.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 -> v2:
- Change some function names in raid56.c to make them fit the code style
  of the raid56.

Thanks
Miao

Miao Xie (7):
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, raid56: fix use-after-free problem in the final device replace
procedure on raid56

Zhao Lei (3):
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  Btrfs: remove unnecessary code of stripe_index assignment in
__btrfs_map_block
  Btrfs, replace: enable dev-replace for raid56

 fs/btrfs/ctree.h   |   7 +-
 fs/btrfs/dev-replace.c |   9 +-
 fs/btrfs/raid56.c  | 763 +-
 fs/btrfs/raid56.h  |  16 +-
 fs/btrfs/scrub.c   | 803 +++--
 fs/btrfs/volumes.c |  52 +++-
 fs/btrfs/volumes.h |  14 +-
 7 files changed, 1531 insertions(+), 133 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 03/10] Btrfs, raid56: don't change bbio and raid_map

2014-12-02 Thread Miao Xie

Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None.
---
 fs/btrfs/raid56.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 66944b9..cb31cc6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,7 +58,6 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
-
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -146,6 +145,10 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
+
+   atomic_t stripes_pending;
+
+   atomic_t error;
/*
 * these are two arrays of pointers.  We allocate the
 * rbio big enough to hold them both and setup their
@@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+   if (!atomic_dec_and_test(&rbio->stripes_pending))
return;
 
err = 0;
 
/* OK, we have read all the stripes we need to. */
-   if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+   if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
err = -EIO;
 
rbio_orig_end_io(rbio, err, 0);
@@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio->faila = -1;
rbio->failb = -1;
atomic_set(&rbio->refs, 1);
+   atomic_set(&rbio->error, 0);
+   atomic_set(&rbio->stripes_pending, 0);
 
/*
 * the stripe_pages and bio_pages array point to the extra
@@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
spin_unlock_irq(&rbio->bio_list_lock);
 
-   atomic_set(&rbio->bbio->error, 0);
+   atomic_set(&rbio->error, 0);
 
/*
 * now that we've set rmw_locked, run through the
@@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
}
}
 
-   atomic_set(&bbio->stripes_pending, bio_list_size(&bio_list));
-   BUG_ON(atomic_read(&bbio->stripes_pending) == 0);
+   atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
+   BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
while (1) {
bio = bio_list_pop(&bio_list);
@@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, 
int failed)
if (rbio->faila == -1) {
/* first failure on this rbio */
rbio->faila = failed;
-   atomic_inc(&rbio->bbio->error);
+   atomic_inc(&rbio->error);
} else if (rbio->failb == -1) {
/* second failure on this rbio */
rbio->failb = failed;
-   atomic_inc(&rbio->bbio->error);
+   atomic_inc(&rbio->error);
} else {
ret = -EIO;
}
@@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+   if (!atomic_dec_and_test(&rbio->stripes_pending))
return;
 
err = 0;
-   if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+   if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
goto cleanup;
 
/*
@@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio 
*rbio)
 static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 {
int bios_to_read = 0;
-   struct btrfs_bio *bbio = rbio->bbio;
struct bio_list bio_list;
int ret;
int nr_pages = DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE);
@@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
index_rbio_pages(rbio);
 
-   atomic_set(&rbio->bbio->error, 0);
+   atomic_set(&rbio->error, 0);
/*
 * build a list of bios to read all the missing parts of this
 * stripe
@@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 * the bbio may be freed once we submit the last bio.  Make sure
 * not to touch it after that
 */
-   atomic_set(&bbio->stripes_pending, bios_to_read);
+   atomic_set(&rbio->stripes_pending, bios_to_read);
while (1) {
bio = bio_list_pop(&bio_list);
if

Re: [PATCH] Btrfs: fix wrong list access on the failure of reading out checksum

2014-12-01 Thread Miao Xie

Please ignore this patch, Chris has fixed this problem.

Thanks
Miao

On Mon, 1 Dec 2014 18:04:13 +0800, Miao Xie wrote:
> If we failed to reading out the checksum, we would free all the checksums
> in the list. But the current code accessed the list head, not the entry
> in the list. Fix it.
> 
> Signed-off-by: Miao Xie 
> ---
>  fs/btrfs/file-item.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 783a943..c26b58f 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -413,7 +413,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
> start, u64 end,
>   ret = 0;
>  fail:
>   while (ret < 0 && !list_empty(&tmplist)) {
> - sums = list_entry(&tmplist, struct btrfs_ordered_sum, list);
> + sums = list_first_entry(&tmplist, struct btrfs_ordered_sum,
> + list);
>   list_del(&sums->list);
>   kfree(sums);
>   }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: fix wrong list access on the failure of reading out checksum

2014-12-01 Thread Miao Xie

If we failed to reading out the checksum, we would free all the checksums
in the list. But the current code accessed the list head, not the entry
in the list. Fix it.

Signed-off-by: Miao Xie 
---
 fs/btrfs/file-item.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 783a943..c26b58f 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -413,7 +413,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
start, u64 end,
ret = 0;
 fail:
while (ret < 0 && !list_empty(&tmplist)) {
-   sums = list_entry(&tmplist, struct btrfs_ordered_sum, list);
+   sums = list_first_entry(&tmplist, struct btrfs_ordered_sum,
+   list);
list_del(&sums->list);
kfree(sums);
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

2014-11-26 Thread Miao Xie

On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote:
> On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
>> On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie  wrote:
>>> The increase/decrease of bio counter is on the I/O path, so we should
>>> use io_schedule() instead of schedule(), or the deadlock might be
>>> triggered by the pending I/O in the plug list. io_schedule() can help
>>> us because it will flush all the pending I/O before the task is going
>>> to sleep.
>>
>> Can you please describe this deadlock in more detail?  schedule() also 
>> triggers
>> a flush of the plug list, and if that's no longer sufficient we can run into 
>> other
>> problems (especially with preemption on).
> 
> Sorry for my miss. I forgot to check the current implementation of 
> schedule(), which flushes the plug list unconditionally. Please ignore this 
> patch.

I have updated my raid56-scrub-replace branch, please re-pull the branch.

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

> 
> Thanks
> Miao
> 
>>
>> -chris
>>
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

2014-11-26 Thread Miao Xie

On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
> On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie  wrote:
>> The increase/decrease of bio counter is on the I/O path, so we should
>> use io_schedule() instead of schedule(), or the deadlock might be
>> triggered by the pending I/O in the plug list. io_schedule() can help
>> us because it will flush all the pending I/O before the task is going
>> to sleep.
> 
> Can you please describe this deadlock in more detail?  schedule() also 
> triggers
> a flush of the plug list, and if that's no longer sufficient we can run into 
> other
> problems (especially with preemption on).

Sorry for my miss. I forgot to check the current implementation of schedule(), 
which flushes the plug list unconditionally. Please ignore this patch.

Thanks
Miao

> 
> -chris
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

2014-11-26 Thread Miao Xie

The increase/decrease of bio counter is on the I/O path, so we should
use io_schedule() instead of schedule(), or the deadlock might be
triggered by the pending I/O in the plug list. io_schedule() can help
us because it will flush all the pending I/O before the task is going
to sleep.

Signed-off-by: Miao Xie 
---
Changelog v2 -> v3:
- New patch to fix possible deadlock caused by the pending bios in the
  plug list when the io submitters were going to sleep.

Changelog v1 -> v2:
- None.
---
 fs/btrfs/dev-replace.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index fa27b4e..894796a 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -928,16 +928,23 @@ void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, 
s64 amount)
wake_up(&fs_info->replace_wait);
 }
 
+#define btrfs_wait_event_io(wq, condition) \
+do {   \
+   if (condition)  \
+   break;  \
+   (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0,  \
+   io_schedule()); \
+} while (0)
+
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info)
 {
-   DEFINE_WAIT(wait);
 again:
percpu_counter_inc(&fs_info->bio_counter);
if (test_bit(BTRFS_FS_STATE_DEV_REPLACING, &fs_info->fs_state)) {
btrfs_bio_counter_dec(fs_info);
-   wait_event(fs_info->replace_wait,
-  !test_bit(BTRFS_FS_STATE_DEV_REPLACING,
-&fs_info->fs_state));
+   btrfs_wait_event_io(fs_info->replace_wait,
+   !test_bit(BTRFS_FS_STATE_DEV_REPLACING,
+ &fs_info->fs_state));
goto again;
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 07/11] Btrfs, replace: write dirty pages into the replace target device

2014-11-26 Thread Miao Xie

The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/raid56.c  | 104 +
 fs/btrfs/volumes.c |  26 --
 fs/btrfs/volumes.h |  10 --
 3 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 3b99cbc..6f82c1b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -124,6 +124,8 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int real_stripes;
+
int stripe_npages;
/*
 * set if we're doing a parity rebuild
@@ -631,7 +633,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio 
*rbio, int index)
  */
 static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index)
 {
-   if (rbio->nr_data + 1 == rbio->bbio->num_stripes)
+   if (rbio->nr_data + 1 == rbio->real_stripes)
return NULL;
 
index += ((rbio->nr_data + 1) * rbio->stripe_len) >>
@@ -974,7 +976,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 {
struct btrfs_raid_bio *rbio;
int nr_data = 0;
-   int num_pages = rbio_nr_pages(stripe_len, bbio->num_stripes);
+   int real_stripes = bbio->num_stripes - bbio->num_tgtdevs;
+   int num_pages = rbio_nr_pages(stripe_len, real_stripes);
int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
void *p;
 
@@ -994,6 +997,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio->fs_info = root->fs_info;
rbio->stripe_len = stripe_len;
rbio->nr_pages = num_pages;
+   rbio->real_stripes = real_stripes;
rbio->stripe_npages = stripe_npages;
rbio->faila = -1;
rbio->failb = -1;
@@ -1010,10 +1014,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio->bio_pages = p + sizeof(struct page *) * num_pages;
rbio->dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
-   if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
-   nr_data = bbio->num_stripes - 2;
+   if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE)
+   nr_data = real_stripes - 2;
else
-   nr_data = bbio->num_stripes - 1;
+   nr_data = real_stripes - 1;
 
rbio->nr_data = nr_data;
return rbio;
@@ -1125,7 +1129,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
 {
if (rbio->faila >= 0 || rbio->failb >= 0) {
-   BUG_ON(rbio->faila == rbio->bbio->num_stripes - 1);
+   BUG_ON(rbio->faila == rbio->real_stripes - 1);
__raid56_parity_recover(rbio);
} else {
finish_rmw(rbio);
@@ -1186,7 +1190,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
struct btrfs_bio *bbio = rbio->bbio;
-   void *pointers[bbio->num_stripes];
+   void *pointers[rbio->real_stripes];
int stripe_len = rbio->stripe_len;
int nr_data = rbio->nr_data;
int stripe;
@@ -1200,11 +1204,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
 
bio_list_init(&bio_list);
 
-   if (bbio->num_stripes - rbio->nr_data == 1) {
-   p_stripe = bbio->num_stripes - 1;
-   } else if (bbio->num_stripes - rbio->nr_data == 2) {
-   p_stripe = bbio->num_stripes - 2;
-   q_stripe = bbio->num_stripes - 1;
+   if (rbio->real_stripes - rbio->nr_data == 1) {
+   p_stripe = rbio->real_stripes - 1;
+   } else if (rbio->real_stripes - rbio->nr_data == 2) {
+   p_stripe = rbio->real_stripes - 2;
+   q_stripe = rbio->real_stripes - 1;
} else {
BUG();
}
@@ -1261,7 +1265,7 @@ static noinline void finish_rmw(struct btrfs_rai

[PATCH v3 09/11] Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56

2014-11-26 Thread Miao Xie

The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie 
---
Changelog v2 -> v3:
- New patch to fix undealt use-after-free problem of the source device
  in the final device replace procedure.

Changelog v1 -> v2:
- None.
---
 fs/btrfs/ctree.h   |  7 ++-
 fs/btrfs/dev-replace.c |  4 ++--
 fs/btrfs/raid56.c  | 41 -
 fs/btrfs/raid56.h  |  4 ++--
 fs/btrfs/scrub.c   |  2 +-
 fs/btrfs/volumes.c |  7 ++-
 6 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fe69edd..470e317 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4097,7 +4097,12 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64 
devid,
 /* dev-replace.c */
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
 void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info);
-void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info);
+void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount);
+
+static inline void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
+{
+   btrfs_bio_counter_sub(fs_info, 1);
+}
 
 /* reada.c */
 struct reada_control {
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 6f662b3..fa27b4e 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -920,9 +920,9 @@ void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info 
*fs_info)
percpu_counter_inc(&fs_info->bio_counter);
 }
 
-void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
+void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
 {
-   percpu_counter_dec(&fs_info->bio_counter);
+   percpu_counter_sub(&fs_info->bio_counter, amount);
 
if (waitqueue_active(&fs_info->replace_wait))
wake_up(&fs_info->replace_wait);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index cfa449f..4bdb822 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -155,6 +155,8 @@ struct btrfs_raid_bio {
 */
int bio_list_bytes;
 
+   int generic_bio_cnt;
+
atomic_t refs;
 
atomic_t stripes_pending;
@@ -347,6 +349,7 @@ static void merge_rbio(struct btrfs_raid_bio *dest,
 {
bio_list_merge(&dest->bio_list, &victim->bio_list);
dest->bio_list_bytes += victim->bio_list_bytes;
+   dest->generic_bio_cnt += victim->generic_bio_cnt;
bio_list_init(&victim->bio_list);
 }
 
@@ -884,6 +887,10 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, 
int err, int uptodate)
 {
struct bio *cur = bio_list_get(&rbio->bio_list);
struct bio *next;
+
+   if (rbio->generic_bio_cnt)
+   btrfs_bio_counter_sub(rbio->fs_info, rbio->generic_bio_cnt);
+
free_raid_bio(rbio);
 
while (cur) {
@@ -1768,6 +1775,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct btrfs_raid_bio *rbio;
struct btrfs_plug_cb *plug = NULL;
struct blk_plug_cb *cb;
+   int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
if (IS_ERR(rbio)) {
@@ -1778,12 +1786,19 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
rbio->bio_list_bytes = bio->bi_iter.bi_size;
rbio->operation = BTRFS_RBIO_WRITE;
 
+   btrfs_bio_counter_inc_noblocked(root->fs_info);
+   rbio->generic_bio_cnt = 1;
+
/*
 * don't plug on full rbios, just get them out the door
 * as quickly as we can
 */
-   if (rbio_is_full(rbio))
-   return full_stripe_write(rbio);
+   if (rbio_is_full(rbio)) {
+   ret = full_stripe_write(rbio);
+   if (ret)
+   btrfs_bio_counter_dec(root->fs_info);
+   return ret;
+   }
 
cb = blk_check_plugged(btrfs_raid_unplug, root->fs_info,
   sizeof(*plug));
@@ -1794,10 +1809,13 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
INIT_LIST_HEAD(&plug->rbio_list);
}
list_add_tail(&rbio->plug_list, &plug->rbio_list);
+   ret = 0;
} else {
-   return __raid56_parity_write(rbio);
+   ret = __raid56_parity

[PATCH v3 11/11] Btrfs, replace: enable dev-replace for raid56

2014-11-26 Thread Miao Xie

From: Zhao Lei 

Signed-off-by: Zhao Lei 
Signed-off-by: Miao Xie 
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/dev-replace.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 894796a..9f6a464 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
struct btrfs_device *tgt_device = NULL;
struct btrfs_device *src_device = NULL;
 
-   if (btrfs_fs_incompat(fs_info, RAID56)) {
-   btrfs_warn(fs_info, "dev_replace cannot yet handle 
RAID5/RAID6");
-   return -EOPNOTSUPP;
-   }
-
switch (args->start.cont_reading_from_srcdev_mode) {
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 06/11] Btrfs, raid56: support parity scrub on raid56

2014-11-26 Thread Miao Xie

The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie 
---
Changelog v2 -> v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 -> v2:
- None.
---
 fs/btrfs/raid56.c | 514 -
 fs/btrfs/raid56.h |  12 ++
 fs/btrfs/scrub.c  | 609 --
 3 files changed, 1115 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index bfc406d..3b99cbc 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -65,6 +65,7 @@
 enum btrfs_rbio_ops {
BTRFS_RBIO_WRITE= 0,
BTRFS_RBIO_READ_REBUILD = 1,
+   BTRFS_RBIO_PARITY_SCRUB = 2,
 };
 
 struct btrfs_raid_bio {
@@ -123,6 +124,7 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int stripe_npages;
/*
 * set if we're doing a parity rebuild
 * for a read from higher up, which is handled
@@ -137,6 +139,7 @@ struct btrfs_raid_bio {
/* second bad stripe (for raid6 use) */
int failb;
 
+   int scrubp;
/*
 * number of pages needed to represent the full
 * stripe
@@ -171,6 +174,11 @@ struct btrfs_raid_bio {
 * here for faster lookup
 */
struct page **bio_pages;
+
+   /*
+* bitmap to record which horizontal stripe has data
+*/
+   unsigned long *dbitmap;
 };
 
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio);
@@ -185,6 +193,10 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
 static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
+static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
+int need_check);
+static void async_scrub_parity(struct btrfs_raid_bio *rbio);
+
 /*
  * the stripe hash table is used for locking, and to collect
  * bios in hopes of making a full stripe
@@ -586,10 +598,20 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
cur->raid_map[0])
return 0;
 
-   /* reads can't merge with writes */
-   if (last->operation != cur->operation) {
+   /* we can't merge with different operations */
+   if (last->operation != cur->operation)
+   return 0;
+   /*
+* We've need read the full stripe from the drive.
+* check and repair the parity and write the new results.
+*
+* We're not allowed to add any new bios to the
+* bio list here, anyone else that wants to
+* change this stripe needs to do their own rmw.
+*/
+   if (last->operation == BTRFS_RBIO_PARITY_SCRUB ||
+   cur->operation == BTRFS_RBIO_PARITY_SCRUB)
return 0;
-   }
 
return 1;
 }
@@ -782,9 +804,12 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
 
if (next->operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else if (next->operation == BTRFS_RBIO_WRITE){
+   else if (next->operation == BTRFS_RBIO_WRITE) {
steal_rbio(rbio, next);
async_rmw_stripe(next);
+

[PATCH v3 02/11] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block

2014-11-26 Thread Miao Xie

From: Zhao Lei 

stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei 
Signed-off-by: Miao Xie 
Reviewed-by: David Sterba 
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 41b0dff..66d5035 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5167,9 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
 
/* push stripe_nr back to the start of the full stripe 
*/
stripe_nr = raid56_full_stripe_start;
-   do_div(stripe_nr, stripe_len);
-
-   stripe_index = do_div(stripe_nr, nr_data_stripes(map));
+   do_div(stripe_nr, stripe_len * nr_data_stripes(map));
 
/* RAID[56] write or recovery. Return all stripes */
num_stripes = map->num_stripes;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 03/11] Btrfs, raid56: don't change bbio and raid_map

2014-11-26 Thread Miao Xie

Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/raid56.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6a41631..c54b0e6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,7 +58,6 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
-
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -146,6 +145,10 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
+
+   atomic_t stripes_pending;
+
+   atomic_t error;
/*
 * these are two arrays of pointers.  We allocate the
 * rbio big enough to hold them both and setup their
@@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+   if (!atomic_dec_and_test(&rbio->stripes_pending))
return;
 
err = 0;
 
/* OK, we have read all the stripes we need to. */
-   if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+   if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
err = -EIO;
 
rbio_orig_end_io(rbio, err, 0);
@@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio->faila = -1;
rbio->failb = -1;
atomic_set(&rbio->refs, 1);
+   atomic_set(&rbio->error, 0);
+   atomic_set(&rbio->stripes_pending, 0);
 
/*
 * the stripe_pages and bio_pages array point to the extra
@@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
spin_unlock_irq(&rbio->bio_list_lock);
 
-   atomic_set(&rbio->bbio->error, 0);
+   atomic_set(&rbio->error, 0);
 
/*
 * now that we've set rmw_locked, run through the
@@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
}
}
 
-   atomic_set(&bbio->stripes_pending, bio_list_size(&bio_list));
-   BUG_ON(atomic_read(&bbio->stripes_pending) == 0);
+   atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
+   BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
while (1) {
bio = bio_list_pop(&bio_list);
@@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, 
int failed)
if (rbio->faila == -1) {
/* first failure on this rbio */
rbio->faila = failed;
-   atomic_inc(&rbio->bbio->error);
+   atomic_inc(&rbio->error);
} else if (rbio->failb == -1) {
/* second failure on this rbio */
rbio->failb = failed;
-   atomic_inc(&rbio->bbio->error);
+   atomic_inc(&rbio->error);
} else {
ret = -EIO;
}
@@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+   if (!atomic_dec_and_test(&rbio->stripes_pending))
return;
 
err = 0;
-   if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+   if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
goto cleanup;
 
/*
@@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio 
*rbio)
 static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 {
int bios_to_read = 0;
-   struct btrfs_bio *bbio = rbio->bbio;
struct bio_list bio_list;
int ret;
int nr_pages = DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE);
@@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
index_rbio_pages(rbio);
 
-   atomic_set(&rbio->bbio->error, 0);
+   atomic_set(&rbio->error, 0);
/*
 * build a list of bios to read all the missing parts of this
 * stripe
@@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 * the bbio may be freed once we submit the last bio.  Make sure
 * not to touch it after that
 */
-   atomic_set(&bbio->stripes_pending, bios_to_read);
+   atomic_set(&rbio->stripes_pending, bios_to_read);
while (1) {
bio = bio_list_pop(&bio_list);
if

[PATCH v3 04/11] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

2014-11-26 Thread Miao Xie

This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/raid56.c  |  42 +---
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 -
 fs/btrfs/volumes.c |  16 -
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 226 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c54b0e6..6013d88 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,8 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
+#define RBIO_HOLD_BBIO_MAP_BIT 4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +801,21 @@ done_nolock:
remove_rbio_from_cache(rbio);
 }
 
+static inline void
+__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+   if (need) {
+   kfree(raid_map);
+   kfree(bbio);
+   }
+}
+
+static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+   __free_bbio_and_raid_map(rbio->bbio, rbio->raid_map,
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
@@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
rbio->stripe_pages[i] = NULL;
}
}
-   kfree(rbio->raid_map);
-   kfree(rbio->bbio);
+
+   free_bbio_and_raid_map(rbio);
+
kfree(rbio);
 }
 
@@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 
rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
GFP_NOFS);
-   if (!rbio) {
-   kfree(raid_map);
-   kfree(bbio);
+   if (!rbio)
return ERR_PTR(-ENOMEM);
-   }
 
bio_list_init(&rbio->bio_list);
INIT_LIST_HEAD(&rbio->plug_list);
@@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct blk_plug_cb *cb;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, 1);
return PTR_ERR(rbio);
+   }
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
 
@@ -2038,15 +2055,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
  struct btrfs_bio *bbio, u64 *raid_map,
- u64 stripe_len, int mirror_num)
+ u64 stripe_len, int mirror_num, int hold_bbio)
 {
struct btrfs_raid_bio *rbio;
int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
return PTR_ERR(rbio);
+   }
 
+   if (hold_bbio)
+   set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
rbio->read_rebuild = 1;
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
@@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
rbio->faila = find_logical_bio_stripe(rbio, bio);
if (rbio->faila == -1) {
BUG();
-   kfree(raid_map);
-   kfree(bbio);
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
kfree(rbio);
return -EIO;
}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 struct btrfs_bio *bbio, u64 *raid_map,
-u64 stripe_len, int mirror_num);
+u64 stripe_len, int mirror_num, int hold_bbio);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
   struct btrfs_bio *bbio, u64 *raid_map,
   u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index efa0831..ca4b9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,13 @@ struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK  16  /* 64k per node/leaf/sector */
 
+struct scrub_recover {
+   atomic_trefs;
+   struct btrfs_bio*bbio;
+   u64 *raid_map;
+   u64

[PATCH v3 08/11] Btrfs, replace: write raid56 parity into the replace target device

2014-11-26 Thread Miao Xie

This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/raid56.c | 23 +++
 fs/btrfs/scrub.c  |  2 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6f82c1b..cfa449f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2311,7 +2311,9 @@ static void raid_write_parity_end_io(struct bio *bio, int 
err)
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 int need_check)
 {
+   struct btrfs_bio *bbio = rbio->bbio;
void *pointers[rbio->real_stripes];
+   DECLARE_BITMAP(pbitmap, rbio->stripe_npages);
int nr_data = rbio->nr_data;
int stripe;
int pagenr;
@@ -2321,6 +2323,7 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
struct page *q_page = NULL;
struct bio_list bio_list;
struct bio *bio;
+   int is_replace = 0;
int ret;
 
bio_list_init(&bio_list);
@@ -2334,6 +2337,11 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
BUG();
}
 
+   if (bbio->num_tgtdevs && bbio->tgtdev_map[rbio->scrubp]) {
+   is_replace = 1;
+   bitmap_copy(pbitmap, rbio->dbitmap, rbio->stripe_npages);
+   }
+
/*
 * Because the higher layers(scrubber) are unlikely to
 * use this area of the disk again soon, so don't cache
@@ -2422,6 +2430,21 @@ writeback:
goto cleanup;
}
 
+   if (!is_replace)
+   goto submit_write;
+
+   for_each_set_bit(pagenr, pbitmap, rbio->stripe_npages) {
+   struct page *page;
+
+   page = rbio_stripe_page(rbio, rbio->scrubp, pagenr);
+   ret = rbio_add_io_page(rbio, &bio_list, page,
+  bbio->tgtdev_map[rbio->scrubp],
+  pagenr, rbio->stripe_len);
+   if (ret)
+   goto cleanup;
+   }
+
+submit_write:
nr_data = bio_list_size(&bio_list);
if (!nr_data) {
/* Every parity is right */
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 7f95afc..0ae837f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2714,7 +2714,7 @@ static void scrub_parity_check_and_repair(struct 
scrub_parity *sparity)
goto out;
 
length = sparity->logic_end - sparity->logic_start + 1;
-   ret = btrfs_map_sblock(sctx->dev_root->fs_info, REQ_GET_READ_MIRRORS,
+   ret = btrfs_map_sblock(sctx->dev_root->fs_info, WRITE,
   sparity->logic_start,
   &length, &bbio, 0, &raid_map);
if (ret || !bbio || !raid_map)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 01/11] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition

2014-11-26 Thread Miao Xie

From: Zhao Lei 

bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909 goto out;

Signed-off-by: Zhao Lei 
Signed-off-by: Miao Xie 
Reviewed-by: David Sterba 
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..41b0dff 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5162,8 +5162,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
BTRFS_BLOCK_GROUP_RAID6)) {
u64 tmp;
 
-   if (bbio_ret && ((rw & REQ_WRITE) || mirror_num > 1)
-   && raid_map_ret) {
+   if (raid_map_ret && ((rw & REQ_WRITE) || mirror_num > 1)) {
int i, rot;
 
/* push stripe_nr back to the start of the full stripe 
*/
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 00/11] Implement device scrub/replace for RAID56

2014-11-26 Thread Miao Xie

This patchset implement the device scrub/replace function for RAID56, the
most implementation of the common data is similar to the other RAID type.
The differentia or difficulty is the parity process. The basic idea is reading
and check the data which has checksum out of the raid56 stripe lock, if the
data is right, then lock the raid56 stripe, read out the other data in the
same stripe, if no IO error happens, calculate the parity and check the
original one, if the original parity is right, the scrub parity passes.
or write out the new one. But if the common data(not parity) that we read out
is wrong, we will try to recover it, and then check and repair the parity.

And in order to avoid making the code more and more complex, we copy some
code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more tests
are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Changelog v2 -> v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Fix possible deadlock caused by the pending bios in the plug list
  when the io submitters were going to sleep.
- Fix undealt use-after-free problem of the source device in the final
  device replace procedure.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 -> v2:
- Change some function names in raid56.c to make them fit the code style
  of the raid56.

Thanks
Miao

Miao Xie (8):
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, raid56: fix use-after-free problem in the final device replace
procedure on raid56
  Btrfs: fix possible deadlock caused by pending I/O in plug list

Zhao Lei (3):
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  Btrfs: remove unnecessary code of stripe_index assignment in
__btrfs_map_block
  Btrfs, replace: enable dev-replace for raid56

 fs/btrfs/dev-replace.c |  20 +-
 fs/btrfs/raid56.c  | 746 -
 fs/btrfs/raid56.h  |  16 +-
 fs/btrfs/scrub.c   | 803 +++--
 fs/btrfs/volumes.c |  52 +++-
 fs/btrfs/volumes.h |  14 +-
 6 files changed, 1521 insertions(+), 130 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 05/11] Btrfs, raid56: use a variant to record the operation type

2014-11-26 Thread Miao Xie

We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/raid56.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6013d88..bfc406d 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -62,6 +62,11 @@
 
 #define RBIO_CACHE_SIZE 1024
 
+enum btrfs_rbio_ops {
+   BTRFS_RBIO_WRITE= 0,
+   BTRFS_RBIO_READ_REBUILD = 1,
+};
+
 struct btrfs_raid_bio {
struct btrfs_fs_info *fs_info;
struct btrfs_bio *bbio;
@@ -124,7 +129,7 @@ struct btrfs_raid_bio {
 * differently from a parity rebuild as part of
 * rmw
 */
-   int read_rebuild;
+   enum btrfs_rbio_ops operation;
 
/* first bad stripe */
int faila;
@@ -147,7 +152,6 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
-
atomic_t stripes_pending;
 
atomic_t error;
@@ -583,8 +587,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
return 0;
 
/* reads can't merge with writes */
-   if (last->read_rebuild !=
-   cur->read_rebuild) {
+   if (last->operation != cur->operation) {
return 0;
}
 
@@ -777,9 +780,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
spin_unlock(&rbio->bio_list_lock);
spin_unlock_irqrestore(&h->lock, flags);
 
-   if (next->read_rebuild)
+   if (next->operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else {
+   else if (next->operation == BTRFS_RBIO_WRITE){
steal_rbio(rbio, next);
async_rmw_stripe(next);
}
@@ -1713,6 +1716,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
}
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
+   rbio->operation = BTRFS_RBIO_WRITE;
 
/*
 * don't plug on full rbios, just get them out the door
@@ -1761,7 +1765,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
faila = rbio->faila;
failb = rbio->failb;
 
-   if (rbio->read_rebuild) {
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
spin_lock_irq(&rbio->bio_list_lock);
set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
spin_unlock_irq(&rbio->bio_list_lock);
@@ -1778,7 +1782,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio->read_rebuild &&
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD &&
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1871,7 +1875,7 @@ pstripe:
 * know they can be trusted.  If this was a read reconstruction,
 * other endio functions will fiddle the uptodate bits
 */
-   if (!rbio->read_rebuild) {
+   if (rbio->operation == BTRFS_RBIO_WRITE) {
for (i = 0;  i < nr_pages; i++) {
if (faila != -1) {
page = rbio_stripe_page(rbio, faila, i);
@@ -1888,7 +1892,7 @@ pstripe:
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio->read_rebuild &&
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD &&
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1904,7 +1908,7 @@ cleanup:
 
 cleanup_io:
 
-   if (rbio->read_rebuild) {
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
if (err == 0)
cache_rbio_pages(rbio);
else
@@ -2042,7 +2046,7 @@ out:
return 0;
 
 cleanup:
-   if (rbio->read_rebuild)
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD)

[PATCH v2 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

2014-11-24 Thread Miao Xie

This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v2:
- Remove the redundant prefix underscores of the function names to make
  them obey the common pattern of the source in raid56.c
---
 fs/btrfs/raid56.c  |  42 +---
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 -
 fs/btrfs/volumes.c |  16 -
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 226 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c54b0e6..6013d88 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,8 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
+#define RBIO_HOLD_BBIO_MAP_BIT 4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +801,21 @@ done_nolock:
remove_rbio_from_cache(rbio);
 }
 
+static inline void
+__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+   if (need) {
+   kfree(raid_map);
+   kfree(bbio);
+   }
+}
+
+static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+   __free_bbio_and_raid_map(rbio->bbio, rbio->raid_map,
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
@@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
rbio->stripe_pages[i] = NULL;
}
}
-   kfree(rbio->raid_map);
-   kfree(rbio->bbio);
+
+   free_bbio_and_raid_map(rbio);
+
kfree(rbio);
 }
 
@@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 
rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
GFP_NOFS);
-   if (!rbio) {
-   kfree(raid_map);
-   kfree(bbio);
+   if (!rbio)
return ERR_PTR(-ENOMEM);
-   }
 
bio_list_init(&rbio->bio_list);
INIT_LIST_HEAD(&rbio->plug_list);
@@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct blk_plug_cb *cb;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, 1);
return PTR_ERR(rbio);
+   }
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
 
@@ -2038,15 +2055,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
  struct btrfs_bio *bbio, u64 *raid_map,
- u64 stripe_len, int mirror_num)
+ u64 stripe_len, int mirror_num, int hold_bbio)
 {
struct btrfs_raid_bio *rbio;
int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
return PTR_ERR(rbio);
+   }
 
+   if (hold_bbio)
+   set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
rbio->read_rebuild = 1;
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
@@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
rbio->faila = find_logical_bio_stripe(rbio, bio);
if (rbio->faila == -1) {
BUG();
-   kfree(raid_map);
-   kfree(bbio);
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
kfree(rbio);
return -EIO;
}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 struct btrfs_bio *bbio, u64 *raid_map,
-u64 stripe_len, int mirror_num);
+u64 stripe_len, int mirror_num, int hold_bbio);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
   struct btrfs_bio *bbio, u64 *raid_map,
   u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index efa0831..ca4b9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,13 @@ struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK  16  /* 64k per node/leaf/sector */
 
+struct scrub_recover {
+

Re: [PATCH] Btrfs: make sure we wait on logged extents when fsycning two subvols

2014-11-20 Thread Miao Xie

On Thu, 6 Nov 2014 10:19:54 -0500, Josef Bacik wrote:
> If we have two fsync()'s race on different subvols one will do all of its work
> to get into the log_tree, wait on it's outstanding IO, and then allow the
> log_tree to finish it's commit.  The problem is we were just free'ing that
> subvols logged extents instead of waiting on them, so whoever lost the race
> wouldn't really have their data on disk.  Fix this by waiting properly instead
> of freeing the logged extents.  Thanks,
> 
> cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/tree-log.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 2d0fa43..70f99b1 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2600,9 +2600,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>   if (atomic_read(&log_root_tree->log_commit[index2])) {
>   blk_finish_plug(&plug);
>   btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
> + btrfs_wait_logged_extents(log, log_transid);

Why not add this log root into a list of log root tree, and then the committer
wait all ordered extents in each log root which is added in that list? By this
way, we can let the committer do some work during the data of ordered extents 
is 
being transferred to the disk.

Thanks
Miao

>   wait_log_commit(trans, log_root_tree,
>   root_log_ctx.log_transid);
> - btrfs_free_logged_extents(log, log_transid);
>   mutex_unlock(&log_root_tree->log_mutex);
>   ret = root_log_ctx.log_ret;
>   goto out;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/9] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block

2014-11-14 Thread Miao Xie

From: Zhao Lei 

stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei 
Signed-off-by: Miao Xie 
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 41b0dff..66d5035 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5167,9 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
 
/* push stripe_nr back to the start of the full stripe 
*/
stripe_nr = raid56_full_stripe_start;
-   do_div(stripe_nr, stripe_len);
-
-   stripe_index = do_div(stripe_nr, nr_data_stripes(map));
+   do_div(stripe_nr, stripe_len * nr_data_stripes(map));
 
/* RAID[56] write or recovery. Return all stripes */
num_stripes = map->num_stripes;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 8/9] Btrfs, replace: write raid56 parity into the replace target device

2014-11-14 Thread Miao Xie

This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie 
---
 fs/btrfs/raid56.c | 23 +++
 fs/btrfs/scrub.c  |  2 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7ad9546a..b69c01f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2305,7 +2305,9 @@ static void raid_write_parity_end_io(struct bio *bio, int 
err)
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 int need_check)
 {
+   struct btrfs_bio *bbio = rbio->bbio;
void *pointers[rbio->real_stripes];
+   DECLARE_BITMAP(pbitmap, rbio->stripe_npages);
int nr_data = rbio->nr_data;
int stripe;
int pagenr;
@@ -2315,6 +2317,7 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
struct page *q_page = NULL;
struct bio_list bio_list;
struct bio *bio;
+   int is_replace = 0;
int ret;
 
bio_list_init(&bio_list);
@@ -2328,6 +2331,11 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
BUG();
}
 
+   if (bbio->num_tgtdevs && bbio->tgtdev_map[rbio->scrubp]) {
+   is_replace = 1;
+   bitmap_copy(pbitmap, rbio->dbitmap, rbio->stripe_npages);
+   }
+
/*
 * Because the higher layers(scrubber) are unlikely to
 * use this area of the disk again soon, so don't cache
@@ -2416,6 +2424,21 @@ writeback:
goto cleanup;
}
 
+   if (!is_replace)
+   goto submit_write;
+
+   for_each_set_bit(pagenr, pbitmap, rbio->stripe_npages) {
+   struct page *page;
+
+   page = rbio_stripe_page(rbio, rbio->scrubp, pagenr);
+   ret = rbio_add_io_page(rbio, &bio_list, page,
+  bbio->tgtdev_map[rbio->scrubp],
+  pagenr, rbio->stripe_len);
+   if (ret)
+   goto cleanup;
+   }
+
+submit_write:
nr_data = bio_list_size(&bio_list);
if (!nr_data) {
/* Every parity is right */
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3ef1e1b..f690c8f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2710,7 +2710,7 @@ static void scrub_parity_check_and_repair(struct 
scrub_parity *sparity)
goto out;
 
length = sparity->logic_end - sparity->logic_start + 1;
-   ret = btrfs_map_sblock(sctx->dev_root->fs_info, REQ_GET_READ_MIRRORS,
+   ret = btrfs_map_sblock(sctx->dev_root->fs_info, WRITE,
   sparity->logic_start,
   &length, &bbio, 0, &raid_map);
if (ret || !bbio || !raid_map)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/9] Btrfs, replace: write dirty pages into the replace target device

2014-11-14 Thread Miao Xie

The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie 
---
 fs/btrfs/raid56.c  | 104 +
 fs/btrfs/volumes.c |  26 --
 fs/btrfs/volumes.h |  10 --
 3 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index a13eb1b..7ad9546a 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -124,6 +124,8 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int real_stripes;
+
int stripe_npages;
/*
 * set if we're doing a parity rebuild
@@ -619,7 +621,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio 
*rbio, int index)
  */
 static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index)
 {
-   if (rbio->nr_data + 1 == rbio->bbio->num_stripes)
+   if (rbio->nr_data + 1 == rbio->real_stripes)
return NULL;
 
index += ((rbio->nr_data + 1) * rbio->stripe_len) >>
@@ -959,7 +961,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 {
struct btrfs_raid_bio *rbio;
int nr_data = 0;
-   int num_pages = rbio_nr_pages(stripe_len, bbio->num_stripes);
+   int real_stripes = bbio->num_stripes - bbio->num_tgtdevs;
+   int num_pages = rbio_nr_pages(stripe_len, real_stripes);
int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
void *p;
 
@@ -979,6 +982,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio->fs_info = root->fs_info;
rbio->stripe_len = stripe_len;
rbio->nr_pages = num_pages;
+   rbio->real_stripes = real_stripes;
rbio->stripe_npages = stripe_npages;
rbio->faila = -1;
rbio->failb = -1;
@@ -995,10 +999,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio->bio_pages = p + sizeof(struct page *) * num_pages;
rbio->dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
-   if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
-   nr_data = bbio->num_stripes - 2;
+   if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE)
+   nr_data = real_stripes - 2;
else
-   nr_data = bbio->num_stripes - 1;
+   nr_data = real_stripes - 1;
 
rbio->nr_data = nr_data;
return rbio;
@@ -1110,7 +1114,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
 {
if (rbio->faila >= 0 || rbio->failb >= 0) {
-   BUG_ON(rbio->faila == rbio->bbio->num_stripes - 1);
+   BUG_ON(rbio->faila == rbio->real_stripes - 1);
__raid56_parity_recover(rbio);
} else {
finish_rmw(rbio);
@@ -1171,7 +1175,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
struct btrfs_bio *bbio = rbio->bbio;
-   void *pointers[bbio->num_stripes];
+   void *pointers[rbio->real_stripes];
int stripe_len = rbio->stripe_len;
int nr_data = rbio->nr_data;
int stripe;
@@ -1185,11 +1189,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
 
bio_list_init(&bio_list);
 
-   if (bbio->num_stripes - rbio->nr_data == 1) {
-   p_stripe = bbio->num_stripes - 1;
-   } else if (bbio->num_stripes - rbio->nr_data == 2) {
-   p_stripe = bbio->num_stripes - 2;
-   q_stripe = bbio->num_stripes - 1;
+   if (rbio->real_stripes - rbio->nr_data == 1) {
+   p_stripe = rbio->real_stripes - 1;
+   } else if (rbio->real_stripes - rbio->nr_data == 2) {
+   p_stripe = rbio->real_stripes - 2;
+   q_stripe = rbio->real_stripes - 1;
} else {
BUG();
}
@@ -1246,7 +1250,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)

[PATCH 3/9] Btrfs, raid56: don't change bbio and raid_map

2014-11-14 Thread Miao Xie

Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie 
---
 fs/btrfs/raid56.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6a41631..c54b0e6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,7 +58,6 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
-
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -146,6 +145,10 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
+
+   atomic_t stripes_pending;
+
+   atomic_t error;
/*
 * these are two arrays of pointers.  We allocate the
 * rbio big enough to hold them both and setup their
@@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+   if (!atomic_dec_and_test(&rbio->stripes_pending))
return;
 
err = 0;
 
/* OK, we have read all the stripes we need to. */
-   if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+   if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
err = -EIO;
 
rbio_orig_end_io(rbio, err, 0);
@@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio->faila = -1;
rbio->failb = -1;
atomic_set(&rbio->refs, 1);
+   atomic_set(&rbio->error, 0);
+   atomic_set(&rbio->stripes_pending, 0);
 
/*
 * the stripe_pages and bio_pages array point to the extra
@@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
spin_unlock_irq(&rbio->bio_list_lock);
 
-   atomic_set(&rbio->bbio->error, 0);
+   atomic_set(&rbio->error, 0);
 
/*
 * now that we've set rmw_locked, run through the
@@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
}
}
 
-   atomic_set(&bbio->stripes_pending, bio_list_size(&bio_list));
-   BUG_ON(atomic_read(&bbio->stripes_pending) == 0);
+   atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
+   BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
while (1) {
bio = bio_list_pop(&bio_list);
@@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, 
int failed)
if (rbio->faila == -1) {
/* first failure on this rbio */
rbio->faila = failed;
-   atomic_inc(&rbio->bbio->error);
+   atomic_inc(&rbio->error);
} else if (rbio->failb == -1) {
/* second failure on this rbio */
rbio->failb = failed;
-   atomic_inc(&rbio->bbio->error);
+   atomic_inc(&rbio->error);
} else {
ret = -EIO;
}
@@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+   if (!atomic_dec_and_test(&rbio->stripes_pending))
return;
 
err = 0;
-   if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+   if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
goto cleanup;
 
/*
@@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio 
*rbio)
 static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 {
int bios_to_read = 0;
-   struct btrfs_bio *bbio = rbio->bbio;
struct bio_list bio_list;
int ret;
int nr_pages = DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE);
@@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
index_rbio_pages(rbio);
 
-   atomic_set(&rbio->bbio->error, 0);
+   atomic_set(&rbio->error, 0);
/*
 * build a list of bios to read all the missing parts of this
 * stripe
@@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 * the bbio may be freed once we submit the last bio.  Make sure
 * not to touch it after that
 */
-   atomic_set(&bbio->stripes_pending, bios_to_read);
+   atomic_set(&rbio->stripes_pending, bios_to_read);
while (1) {
bio = bio_list_pop(&bio_list);
if (!bio)
@@ -1917,10 +1921,10 @@ static void

[PATCH 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

2014-11-14 Thread Miao Xie

This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie 
---
 fs/btrfs/raid56.c  |  42 +---
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 -
 fs/btrfs/volumes.c |  16 -
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 226 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c54b0e6..b3e9c76 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,8 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
+#define RBIO_HOLD_BBIO_MAP_BIT 4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +801,21 @@ done_nolock:
remove_rbio_from_cache(rbio);
 }
 
+static inline void
+___free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+   if (need) {
+   kfree(raid_map);
+   kfree(bbio);
+   }
+}
+
+static inline void __free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+   ___free_bbio_and_raid_map(rbio->bbio, rbio->raid_map,
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
@@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
rbio->stripe_pages[i] = NULL;
}
}
-   kfree(rbio->raid_map);
-   kfree(rbio->bbio);
+
+   __free_bbio_and_raid_map(rbio);
+
kfree(rbio);
 }
 
@@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 
rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
GFP_NOFS);
-   if (!rbio) {
-   kfree(raid_map);
-   kfree(bbio);
+   if (!rbio)
return ERR_PTR(-ENOMEM);
-   }
 
bio_list_init(&rbio->bio_list);
INIT_LIST_HEAD(&rbio->plug_list);
@@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct blk_plug_cb *cb;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   ___free_bbio_and_raid_map(bbio, raid_map, 1);
return PTR_ERR(rbio);
+   }
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
 
@@ -2038,15 +2055,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
  struct btrfs_bio *bbio, u64 *raid_map,
- u64 stripe_len, int mirror_num)
+ u64 stripe_len, int mirror_num, int hold_bbio)
 {
struct btrfs_raid_bio *rbio;
int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   ___free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
return PTR_ERR(rbio);
+   }
 
+   if (hold_bbio)
+   set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
rbio->read_rebuild = 1;
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
@@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
rbio->faila = find_logical_bio_stripe(rbio, bio);
if (rbio->faila == -1) {
BUG();
-   kfree(raid_map);
-   kfree(bbio);
+   ___free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
kfree(rbio);
return -EIO;
}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 struct btrfs_bio *bbio, u64 *raid_map,
-u64 stripe_len, int mirror_num);
+u64 stripe_len, int mirror_num, int hold_bbio);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
   struct btrfs_bio *bbio, u64 *raid_map,
   u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index efa0831..ca4b9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,13 @@ struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK  16  /* 64k per node/leaf/sector */
 
+struct scrub_recover {
+   atomic_trefs;
+   struct btrfs_bio*bbio;
+   u64 *raid_map;
+   u64 map_length;
+};

[PATCH 5/9] Btrfs,raid56: use a variant to record the operation type

2014-11-14 Thread Miao Xie

We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie 
---
 fs/btrfs/raid56.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index b3e9c76..d550e9b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -62,6 +62,11 @@
 
 #define RBIO_CACHE_SIZE 1024
 
+enum btrfs_rbio_ops {
+   BTRFS_RBIO_WRITE= 0,
+   BTRFS_RBIO_READ_REBUILD = 1,
+};
+
 struct btrfs_raid_bio {
struct btrfs_fs_info *fs_info;
struct btrfs_bio *bbio;
@@ -124,7 +129,7 @@ struct btrfs_raid_bio {
 * differently from a parity rebuild as part of
 * rmw
 */
-   int read_rebuild;
+   enum btrfs_rbio_ops operation;
 
/* first bad stripe */
int faila;
@@ -147,7 +152,6 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
-
atomic_t stripes_pending;
 
atomic_t error;
@@ -583,8 +587,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
return 0;
 
/* reads can't merge with writes */
-   if (last->read_rebuild !=
-   cur->read_rebuild) {
+   if (last->operation != cur->operation) {
return 0;
}
 
@@ -777,9 +780,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
spin_unlock(&rbio->bio_list_lock);
spin_unlock_irqrestore(&h->lock, flags);
 
-   if (next->read_rebuild)
+   if (next->operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else {
+   else if (next->operation == BTRFS_RBIO_WRITE){
steal_rbio(rbio, next);
async_rmw_stripe(next);
}
@@ -1713,6 +1716,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
}
bio_list_add(&rbio->bio_list, bio);
rbio->bio_list_bytes = bio->bi_iter.bi_size;
+   rbio->operation = BTRFS_RBIO_WRITE;
 
/*
 * don't plug on full rbios, just get them out the door
@@ -1761,7 +1765,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
faila = rbio->faila;
failb = rbio->failb;
 
-   if (rbio->read_rebuild) {
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
spin_lock_irq(&rbio->bio_list_lock);
set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
spin_unlock_irq(&rbio->bio_list_lock);
@@ -1778,7 +1782,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio->read_rebuild &&
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD &&
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1871,7 +1875,7 @@ pstripe:
 * know they can be trusted.  If this was a read reconstruction,
 * other endio functions will fiddle the uptodate bits
 */
-   if (!rbio->read_rebuild) {
+   if (rbio->operation == BTRFS_RBIO_WRITE) {
for (i = 0;  i < nr_pages; i++) {
if (faila != -1) {
page = rbio_stripe_page(rbio, faila, i);
@@ -1888,7 +1892,7 @@ pstripe:
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio->read_rebuild &&
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD &&
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1904,7 +1908,7 @@ cleanup:
 
 cleanup_io:
 
-   if (rbio->read_rebuild) {
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
if (err == 0)
cache_rbio_pages(rbio);
else
@@ -2042,7 +2046,7 @@ out:
return 0;
 
 cleanup:
-   if (rbio->read_rebuild)
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD)
rbio_orig_end_io(rbio, -EIO

[PATCH 1/9] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition

2014-11-14 Thread Miao Xie

From: Zhao Lei 

bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909 goto out;

Signed-off-by: Zhao Lei 
Signed-off-by: Miao Xie 
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..41b0dff 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5162,8 +5162,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
BTRFS_BLOCK_GROUP_RAID6)) {
u64 tmp;
 
-   if (bbio_ret && ((rw & REQ_WRITE) || mirror_num > 1)
-   && raid_map_ret) {
+   if (raid_map_ret && ((rw & REQ_WRITE) || mirror_num > 1)) {
int i, rot;
 
/* push stripe_nr back to the start of the full stripe 
*/
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 9/9] Btrfs, replace: enable dev-replace for raid56

2014-11-14 Thread Miao Xie

From: Zhao Lei 

Signed-off-by: Zhao Lei 
Signed-off-by: Miao Xie 
---
 fs/btrfs/dev-replace.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 6f662b3..6aa835c 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
struct btrfs_device *tgt_device = NULL;
struct btrfs_device *src_device = NULL;
 
-   if (btrfs_fs_incompat(fs_info, RAID56)) {
-   btrfs_warn(fs_info, "dev_replace cannot yet handle 
RAID5/RAID6");
-   return -EOPNOTSUPP;
-   }
-
switch (args->start.cont_reading_from_srcdev_mode) {
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/9] Btrfs,raid56: support parity scrub on raid56

2014-11-14 Thread Miao Xie

The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-tripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie 
---
 fs/btrfs/raid56.c | 500 -
 fs/btrfs/raid56.h |  12 ++
 fs/btrfs/scrub.c  | 599 +-
 3 files changed, 1099 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d550e9b..a13eb1b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -65,6 +65,7 @@
 enum btrfs_rbio_ops {
BTRFS_RBIO_WRITE= 0,
BTRFS_RBIO_READ_REBUILD = 1,
+   BTRFS_RBIO_PARITY_SCRUB = 2,
 };
 
 struct btrfs_raid_bio {
@@ -123,6 +124,7 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int stripe_npages;
/*
 * set if we're doing a parity rebuild
 * for a read from higher up, which is handled
@@ -137,6 +139,7 @@ struct btrfs_raid_bio {
/* second bad stripe (for raid6 use) */
int failb;
 
+   int scrubp;
/*
 * number of pages needed to represent the full
 * stripe
@@ -171,6 +174,11 @@ struct btrfs_raid_bio {
 * here for faster lookup
 */
struct page **bio_pages;
+
+   /*
+* bitmap to record which horizontal stripe has data
+*/
+   unsigned long *dbitmap;
 };
 
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio);
@@ -185,6 +193,8 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
 static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
+static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
+int need_check);
 /*
  * the stripe hash table is used for locking, and to collect
  * bios in hopes of making a full stripe
@@ -950,9 +960,11 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
struct btrfs_raid_bio *rbio;
int nr_data = 0;
int num_pages = rbio_nr_pages(stripe_len, bbio->num_stripes);
+   int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
void *p;
 
-   rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
+   rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2 +
+  DIV_ROUND_UP(stripe_npages, BITS_PER_LONG / 8),
GFP_NOFS);
if (!rbio)
return ERR_PTR(-ENOMEM);
@@ -967,6 +979,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio->fs_info = root->fs_info;
rbio->stripe_len = stripe_len;
rbio->nr_pages = num_pages;
+   rbio->stripe_npages = stripe_npages;
rbio->faila = -1;
rbio->failb = -1;
atomic_set(&rbio->refs, 1);
@@ -980,6 +993,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
p = rbio + 1;
rbio->stripe_pages = p;
rbio->bio_pages = p + sizeof(struct page *) * num_pages;
+   rbio->dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
nr_data = bbio->num_stripes - 2;
@@ -1774,6 +1788,14 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
index_rbio_pages(rbio);
 
for (pagenr = 0; pagenr < nr_pages; pagenr++) {
+   /*
+* Now we just use bitmap to mark the horizontal stripes in
+* which we have data when doing parity scrub.
+*

[PATCH 0/9] Implement device scrub/replace for RAID56

2014-11-14 Thread Miao Xie

This patchset implement the device scrub/replace function for RAID56, the
most implementation of the common data is similar to the other RAID type.
The differentia or difficulty is the parity process. In order to avoid
that problem the data that is easy to be change out the stripe lock,
we do most work in the RAID56 stripe lock context.

And in order to avoid making the code more and more complex, we copy some
code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more tests
are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

Miao Xie (6):
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs,raid56: use a variant to record the operation type
  Btrfs,raid56: support parity scrub on raid56
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, replace: write raid56 parity into the replace target device

Zhao Lei (3):
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  Btrfs: remove unnecessary code of stripe_index assignment in
__btrfs_map_block
  Btrfs, replace: enable dev-replace for raid56

 fs/btrfs/dev-replace.c |   5 -
 fs/btrfs/raid56.c  | 711 +++-
 fs/btrfs/raid56.h  |  14 +-
 fs/btrfs/scrub.c   | 793 +++--
 fs/btrfs/volumes.c |  47 ++-
 fs/btrfs/volumes.h |  14 +-
 6 files changed, 1471 insertions(+), 113 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix incorrect compression ratio detection

2014-11-09 Thread Miao Xie

On Tue, 7 Oct 2014 18:44:35 -0400, Wang Shilong wrote:
> Steps to reproduce:
>  # mkfs.btrfs -f /dev/sdb
>  # mount -t btrfs /dev/sdb /mnt -o compress=lzo
>  # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1
> 
> after previous steps, inode will be detected as bad compression ratio,
> and NOCOMPRESS flag will be set for that inode.
> 
> Reason is that compress have a max limit pages every time(128K), if a
> 132k write in, it will be splitted into two write(128k+4k), this bug
> is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write)
> 
> Fix this problem by checking every time before compression, if it is a
> small write(<=blocksize), we bail out and fall into nocompression directly.
> 
> Signed-off-by: Wang Shilong 

Looks good.

Reviewed-by: Miao Xie 

> ---
>  fs/btrfs/inode.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 344a322..b78e90a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -411,14 +411,6 @@ static noinline int compress_file_range(struct inode 
> *inode,
>   (start > 0 || end + 1 < BTRFS_I(inode)->disk_i_size))
>   btrfs_add_inode_defrag(NULL, inode);
>  
> - /*
> -  * skip compression for a small file range(<=blocksize) that
> -  * isn't an inline extent, since it dosen't save disk space at all.
> -  */
> - if ((end - start + 1) <= blocksize &&
> - (start > 0 || end + 1 < BTRFS_I(inode)->disk_i_size))
> - goto cleanup_and_bail_uncompressed;
> -
>   actual_end = min_t(u64, isize, end + 1);
>  again:
>   will_compress = 0;
> @@ -440,6 +432,14 @@ again:
>  
>   total_compressed = actual_end - start;
>  
> + /*
> +  * skip compression for a small file range(<=blocksize) that
> +  * isn't an inline extent, since it dosen't save disk space at all.
> +  */
> + if (total_compressed <= blocksize &&
> +(start > 0 || end + 1 < BTRFS_I(inode)->disk_i_size))
> + goto cleanup_and_bail_uncompressed;
> +
>   /* we want to make sure that amount of ram required to uncompress
>* an extent is reasonable, so we limit the total size in ram
>* of a compressed extent to 128k.  This is a crucial number
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: don't do async reclaim during log replay V2

2014-11-06 Thread Miao Xie

On Thu, 6 Nov 2014 09:39:19 -0500, Josef Bacik wrote:
> On 10/23/2014 04:44 AM, Miao Xie wrote:
>> On Thu, 18 Sep 2014 11:27:17 -0400, Josef Bacik wrote:
>>> Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
>>> during log replay.  This is because we use fs_info->fs_root as our root for
>>> shrinking and such.  Technically we can use whatever root we want, but let's
>>> just not allow async reclaim while we're doing log replay.  Thanks,
>>
>> Why not move the code of fs_root initialization to the front of log replay?
>> I think it is better than the fix way in this patch because the async 
>> reclaimer
>> can help us do some work.
>>
> 
> Because this is simpler.  We could move the initialization forward, but then 
> say somebody comes and adds some other dependency to the async reclaim stuff 
> in the future and doesn't think about log replay and suddenly some poor sap's 
> box panics on mount.  Log replay is a known quantity, we don't have to worry 
> about enospc, so lets make it as simple as possible.  Thanks,

Yes, you are right.

So this patch looks good.

Reviewed-by: Miao Xie 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2

2014-11-04 Thread Miao Xie

On Mon, 3 Nov 2014 08:56:50 -0500, Josef Bacik wrote:
> Our gluster boxes get several thousand statfs() calls per second, which begins
> to suck hardcore with all of the lock contention on the chunk mutex and dev 
> list
> mutex.  We don't really need to hold these things, if we have transient
> weirdness with statfs() because of the chunk allocator we don't care, so 
> remove
> this locking.
> 
> We still need the dev_list lock if you mount with -o alloc_start however, 
> which
> is a good argument for nuking that thing from orbit, but that's a patch for
> another day.  Thanks,
> 
> Signed-off-by: Josef Bacik 
> ---
> V1->V2: make sure ->alloc_start is set before doing the dev extent lookup 
> logic.

I am strange that why we need dev_list_lock if we mount with -o alloc_start. 
AFAIK.
->alloc_start is protected by chunk_mutex.

But I think we needn't care that someone changes ->alloc_start, in other words, 
we needn't take chunk_mutex during the whole process, the following case can be
tolerated by the users, I think.

Task1   Task2
statfs
  mutex_lock(&fs_info->chunk_mutex);
  tmp = fs_info->alloc_start;
  mutex_unlock(&fs_info->chunk_mutex);
  btrfs_calc_avail_data_space(fs_info, tmp)
...
mount -o 
remount,alloc_start=
...

Thanks
Miao

> 
>  fs/btrfs/super.c | 72 
> 
>  1 file changed, 47 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 54bd91e..dc337d1 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1644,8 +1644,20 @@ static int btrfs_calc_avail_data_space(struct 
> btrfs_root *root, u64 *free_bytes)
>   int i = 0, nr_devices;
>   int ret;
>  
> + /*
> +  * We aren't under the device list lock, so this is racey-ish, but good
> +  * enough for our purposes.
> +  */
>   nr_devices = fs_info->fs_devices->open_devices;
> - BUG_ON(!nr_devices);
> + if (!nr_devices) {
> + smp_mb();
> + nr_devices = fs_info->fs_devices->open_devices;
> + ASSERT(nr_devices);
> + if (!nr_devices) {
> + *free_bytes = 0;
> + return 0;
> + }
> + }
>  
>   devices_info = kmalloc_array(nr_devices, sizeof(*devices_info),
>  GFP_NOFS);
> @@ -1670,11 +1682,17 @@ static int btrfs_calc_avail_data_space(struct 
> btrfs_root *root, u64 *free_bytes)
>   else
>   min_stripe_size = BTRFS_STRIPE_LEN;
>  
> - list_for_each_entry(device, &fs_devices->devices, dev_list) {
> + if (fs_info->alloc_start)
> + mutex_lock(&fs_devices->device_list_mutex);
> + rcu_read_lock();
> + list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
>   if (!device->in_fs_metadata || !device->bdev ||
>   device->is_tgtdev_for_dev_replace)
>   continue;
>  
> + if (i >= nr_devices)
> + break;
> +
>   avail_space = device->total_bytes - device->bytes_used;
>  
>   /* align with stripe_len */
> @@ -1689,24 +1707,32 @@ static int btrfs_calc_avail_data_space(struct 
> btrfs_root *root, u64 *free_bytes)
>   skip_space = 1024 * 1024;
>  
>   /* user can set the offset in fs_info->alloc_start. */
> - if (fs_info->alloc_start + BTRFS_STRIPE_LEN <=
> - device->total_bytes)
> + if (fs_info->alloc_start &&
> + fs_info->alloc_start + BTRFS_STRIPE_LEN <=
> + device->total_bytes) {
> + rcu_read_unlock();
>   skip_space = max(fs_info->alloc_start, skip_space);
>  
> - /*
> -  * btrfs can not use the free space in [0, skip_space - 1],
> -  * we must subtract it from the total. In order to implement
> -  * it, we account the used space in this range first.
> -  */
> - ret = btrfs_account_dev_extents_size(device, 0, skip_space - 1,
> -  &used_space);
> - if (ret) {
> - kfree(devices_info);
> - return ret;
> - }
> + /*
> +  * btrfs can not use the free space in
> +  * [0, skip_space - 1], we must subtract it from the
> +  * total. In order to implement it, we account the used
> +  * space in this range first.
> +  */
> + ret = btrfs_account_dev_extents_size(device, 0,
> +  skip_space - 1,
> +  &used_space);
> +

Re: [PATCH v3] Btrfs: fix snapshot inconsistency after a file write followed by truncate

2014-10-29 Thread Miao Xie

On Wed, 29 Oct 2014 08:21:12 +, Filipe Manana wrote:
> If right after starting the snapshot creation ioctl we perform a write 
> against a
> file followed by a truncate, with both operations increasing the file's size, 
> we
> can get a snapshot tree that reflects a state of the source subvolume's tree 
> where
> the file truncation happened but the write operation didn't. This leaves a gap
> between 2 file extent items of the inode, which makes btrfs' fsck complain 
> about it.
> 
> For example, if we perform the following file operations:
> 
> $ mkfs.btrfs -f /dev/vdd
> $ mount /dev/vdd /mnt
> $ xfs_io -f \
>   -c "pwrite -S 0xaa -b 32K 0 32K" \
>   -c "fsync" \
>   -c "pwrite -S 0xbb -b 32770 16K 32770" \
>   -c "truncate 90123" \
>   /mnt/foobar
> 
> and the snapshot creation ioctl was just called before the second write, we 
> often
> can get the following inode items in the snapshot's btree:
> 
> item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
> inode generation 146 transid 7 size 90123 block group 0 mode 
> 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
> item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
> inode ref index 282 namelen 10 name: foobar
> item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
> extent data disk byte 1104855040 nr 32768
> extent data offset 0 nr 32768 ram 32768
> extent compression 0
> item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
> extent data disk byte 0 nr 0
> extent data offset 0 nr 40960 ram 40960
> extent compression 0
> 
> There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 
> 4096)[
> for which there's no file extent item covering it. This is because the file 
> write
> and file truncate operations happened both right after the snapshot creation 
> ioctl
> called btrfs_start_delalloc_inodes(), which means we didn't start and wait 
> for the
> ordered extent that matches the write and, in btrfs_setsize(), we were able 
> to call
> btrfs_cont_expand() before being able to commit the current transaction in the
> snapshot creation ioctl. So this made it possibe to insert the hole file 
> extent
> item in the source subvolume (which represents the region added by the 
> truncate)
> right before the transaction commit from the snapshot creation ioctl.
> 
> Btrfs' fsck tool complains about such cases with a message like the following:
> 
> "root 331 inode 257 errors 100, file extent discount"
> 
>>From a user perspective, the expectation when a snapshot is created while 
>>those
> file operations are being performed is that the snapshot will have a file that
> either:
> 
> 1) is empty
> 2) only the first write was captured
> 3) only the 2 writes were captured
> 4) both writes and the truncation were captured
> 
> But never capture a state where only the first write and the truncation were
> captured (since the second write was performed before the truncation).
> 
> A test case for xfstests follows.
> 
> Signed-off-by: Filipe Manana 
> ---
> 
> V2: Use different approach to solve the problem. Don't start and wait for all
> dellaloc to finish after every expanding truncate, instead add an 
> additional
> flush at transaction commit time if we're doing a transaction commit that
> creates snapshots.

This method will make the transaction commit spend more time, why not use
i_disk_size to expand the file size in btrfs_setsize()? Or we might rename
btrfs_{start, end}_nocow_write(), and use them in btrfs_setsize()?

Thanks
Miao

> 
> V3: Removed useless test condition in +wait_pending_snapshot_roots_delalloc().
> 
>  fs/btrfs/transaction.c | 59 
> ++
>  1 file changed, 59 insertions(+)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 396ae8b..5e7f004 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1714,12 +1714,65 @@ static inline void btrfs_wait_delalloc_flush(struct 
> btrfs_fs_info *fs_info)
>   btrfs_wait_ordered_roots(fs_info, -1);
>  }
>  
> +static int
> +start_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans,
> +   struct list_head *splice)
> +{
> + struct btrfs_pending_snapshot *pending_snapshot;
> + int ret = 0;
> +
> + if (btrfs_test_opt(trans->root, FLUSHONCOMMIT))
> + return 0;
> +
> + spin_lock(&trans->root->fs_info->trans_lock);
> + list_splice_init(&trans->transaction->pending_snapshots, splice);
> + spin_unlock(&trans->root->fs_info->trans_lock);
> +
> + /*
> +  * Start again delalloc for the roots our pending snapshots are made
> +  * from. We did it before starting/joining a transaction and we do it
> +  * here again because new inode operations might have happened

Re: [PATCH] Btrfs: don't do async reclaim during log replay V2

2014-10-29 Thread Miao Xie

Ping..

On Thu, 23 Oct 2014 16:44:54 +0800, Miao Xie wrote:
> On Thu, 18 Sep 2014 11:27:17 -0400, Josef Bacik wrote:
>> Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
>> during log replay.  This is because we use fs_info->fs_root as our root for
>> shrinking and such.  Technically we can use whatever root we want, but let's
>> just not allow async reclaim while we're doing log replay.  Thanks,
> 
> Why not move the code of fs_root initialization to the front of log replay?
> I think it is better than the fix way in this patch because the async 
> reclaimer
> can help us do some work.
> 
> Thanks
> Miao
> 
>>
>> Signed-off-by: Josef Bacik 
>> ---
>> V1->V2: use fs_info->log_root_recovering instead, didn't notice this existed
>> before.
>>
>>  fs/btrfs/extent-tree.c | 8 +++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 28a27d5..44d0497 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4513,7 +4513,13 @@ again:
>>  space_info->flush = 1;
>>  } else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
>>  used += orig_bytes;
>> -if (need_do_async_reclaim(space_info, root->fs_info, used) &&
>> +/*
>> + * We will do the space reservation dance during log replay,
>> + * which means we won't have fs_info->fs_root set, so don't do
>> + * the async reclaim as we will panic.
>> + */
>> +if (!root->fs_info->log_root_recovering &&
>> +need_do_async_reclaim(space_info, root->fs_info, used) &&
>>  !work_busy(&root->fs_info->async_reclaim_work))
>>  queue_work(system_unbound_wq,
>> &root->fs_info->async_reclaim_work);
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items

2014-10-27 Thread Miao Xie

On Mon, 27 Oct 2014 13:44:22 +, Filipe David Manana wrote:
> On Mon, Oct 27, 2014 at 12:11 PM, Filipe David Manana
>  wrote:
>> On Mon, Oct 27, 2014 at 11:08 AM, Miao Xie  wrote:
>>> On Mon, 27 Oct 2014 09:19:52 +, Filipe Manana wrote:
>>>> We have a race that can lead us to miss skinny extent items in the function
>>>> btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
>>>> So basically the sequence of steps is:
>>>>
>>>> 1) We search in the extent tree for the skinny extent, which returns > 0
>>>>(not found);
>>>>
>>>> 2) We check the previous item in the returned leaf for a non-skinny extent,
>>>>and we don't find it;
>>>>
>>>> 3) Because we didn't find the non-skinny extent in step 2), we release our
>>>>path to search the extent tree again, but this time for a non-skinny
>>>>extent key;
>>>>
>>>> 4) Right after we released our path in step 3), a skinny extent was 
>>>> inserted
>>>>in the extent tree (delayed refs were run) - our second extent tree 
>>>> search
>>>>will miss it, because it's not looking for a skinny extent;
>>>>
>>>> 5) After the second search returned (with ret > 0), we look for any delayed
>>>>ref for our extent's bytenr (and we do it while holding a read lock on 
>>>> the
>>>>leaf), but we won't find any, as such delayed ref had just run and 
>>>> completed
>>>>after we released out path in step 3) before doing the second search.
>>>>
>>>> Fix this by removing completely the path release and re-search logic. This 
>>>> is
>>>> safe, because if we seach for a metadata item and we don't find it, we 
>>>> have the
>>>> guarantee that the returned leaf is the one where the item would be 
>>>> inserted,
>>>> and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
>>>> non-skinny extent item is if it exists. The only case where path->slots[0] 
>>>> is
>>>
>>> I think this analysis is wrong if there are some independent shared ref 
>>> metadata for
>>> a tree block, just like:
>>> ++-+-+
>>> | tree block extent item | shared ref1 | shared ref2 |
>>> ++-+-+
> 
> Trying to guess what's in your mind.
> 
> Is the concern that if after a non-skinny extent item we have
> non-inlined references, the assumption that path->slots[0] - 1 points
> to the extent item would be wrong when searching for a skinny extent?
> 
> That wouldn't be the case because BTRFS_EXTENT_ITEM_KEY == 168 and
> BTRFS_METADATA_ITEM_KEY == 169, with BTRFS_SHARED_BLOCK_REF_KEY ==
> 182. So in the presence of such non-inlined shared tree block
> reference items, searching for a skinny extent item leaves us at a
> slot that points to the first non-inlined ref (regardless of its type,
> since they're all > 169), and therefore path->slots[0] - 1 is the
> non-skinny extent item.

You are right. I forget to check the value of key type. Sorry.

This patch seems good for me.

Reviewed-by: Miao Xie 

> 
> thanks.
> 
>>
>> Why does that matters? Can you elaborate why it's not correct?
>>
>> We're looking for the extent item only in btrfs_lookup_extent_info(),
>> and running a delayed ref, independently of being inlined/shared, it
>> implies inserting a new extent item or updating an existing extent
>> item (updating ref count).
>>
>> thanks
>>
>>>
>>> Thanks
>>> Miao
>>>
>>>> zero is when there are no smaller keys in the tree (i.e. no left siblings 
>>>> for
>>>> our leaf), in which case the re-search logic isn't needed as well.
>>>>
>>>> This race has been present since the introduction of skinny metadata 
>>>> (change
>>>> 3173a18f70554fe7880bb2d85c7da566e364eb3c).
>>>>
>>>> Signed-off-by: Filipe Manana 
>>>> ---
>>>>  fs/btrfs/extent-tree.c | 8 
>>>>  1 file changed, 8 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index 9141b2b..2cedd06 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -780,7 +780,6 @@ search_again:
&g

Re: [PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items

2014-10-27 Thread Miao Xie

On Mon, 27 Oct 2014 09:19:52 +, Filipe Manana wrote:
> We have a race that can lead us to miss skinny extent items in the function
> btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
> So basically the sequence of steps is:
> 
> 1) We search in the extent tree for the skinny extent, which returns > 0
>(not found);
> 
> 2) We check the previous item in the returned leaf for a non-skinny extent,
>and we don't find it;
> 
> 3) Because we didn't find the non-skinny extent in step 2), we release our
>path to search the extent tree again, but this time for a non-skinny
>extent key;
> 
> 4) Right after we released our path in step 3), a skinny extent was inserted
>in the extent tree (delayed refs were run) - our second extent tree search
>will miss it, because it's not looking for a skinny extent;
> 
> 5) After the second search returned (with ret > 0), we look for any delayed
>ref for our extent's bytenr (and we do it while holding a read lock on the
>leaf), but we won't find any, as such delayed ref had just run and 
> completed
>after we released out path in step 3) before doing the second search.
> 
> Fix this by removing completely the path release and re-search logic. This is
> safe, because if we seach for a metadata item and we don't find it, we have 
> the
> guarantee that the returned leaf is the one where the item would be inserted,
> and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
> non-skinny extent item is if it exists. The only case where path->slots[0] is

I think this analysis is wrong if there are some independent shared ref 
metadata for
a tree block, just like:
++-+-+
| tree block extent item | shared ref1 | shared ref2 |
++-+-+

Thanks
Miao

> zero is when there are no smaller keys in the tree (i.e. no left siblings for
> our leaf), in which case the re-search logic isn't needed as well.
> 
> This race has been present since the introduction of skinny metadata (change
> 3173a18f70554fe7880bb2d85c7da566e364eb3c).
> 
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/extent-tree.c | 8 
>  1 file changed, 8 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 9141b2b..2cedd06 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -780,7 +780,6 @@ search_again:
>   else
>   key.type = BTRFS_EXTENT_ITEM_KEY;
>  
> -again:
>   ret = btrfs_search_slot(trans, root->fs_info->extent_root,
>   &key, path, 0, 0);
>   if (ret < 0)
> @@ -796,13 +795,6 @@ again:
>   key.offset == root->nodesize)
>   ret = 0;
>   }
> - if (ret) {
> - key.objectid = bytenr;
> - key.type = BTRFS_EXTENT_ITEM_KEY;
> - key.offset = root->nodesize;
> - btrfs_release_path(path);
> - goto again;
> - }
>   }
>  
>   if (ret == 0) {
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()

2014-10-27 Thread Miao Xie

On Mon, 27 Oct 2014 09:16:55 +, Filipe Manana wrote:
> If we couldn't find our extent item, we accessed the current slot
> (path->slots[0]) to check if it corresponds to an equivalent skinny
> metadata item. However this slot could be beyond our last item in the
> leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
> we shouldn't process it.
> 
> Since btrfs_lookup_extent() is only used to find extent items for data
> extents, fix this by removing completely the logic that looks up for an
> equivalent skinny metadata item, since it can not exist.

I think we also need a better function name, such as btrfs_lookup_data_extent.

Thanks
Miao

> 
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/extent-tree.c | 8 +---
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0d599ba..9141b2b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -710,7 +710,7 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info 
> *info)
>   rcu_read_unlock();
>  }
>  
> -/* simple helper to search for an existing extent at a given offset */
> +/* simple helper to search for an existing data extent at a given offset */
>  int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len)
>  {
>   int ret;
> @@ -726,12 +726,6 @@ int btrfs_lookup_extent(struct btrfs_root *root, u64 
> start, u64 len)
>   key.type = BTRFS_EXTENT_ITEM_KEY;
>   ret = btrfs_search_slot(NULL, root->fs_info->extent_root, &key, path,
>   0, 0);
> - if (ret > 0) {
> - btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> - if (key.objectid == start &&
> - key.type == BTRFS_METADATA_ITEM_KEY)
> - ret = 0;
> - }
>   btrfs_free_path(path);
>   return ret;
>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: device balance times

2014-10-23 Thread Miao Xie

On Wed, 22 Oct 2014 14:40:47 +0200, Piotr Pawłow wrote:
> On 22.10.2014 03:43, Chris Murphy wrote:
>> On Oct 21, 2014, at 4:14 PM, Piotr Pawłow  wrote:
>>> Looks normal to me. Last time I started a balance after adding 6th device 
>>> to my FS, it took 4 days to move 25GBs of data.
>> It's long term untenable. At some point it must be fixed. It's way, way 
>> slower than md raid.
>> At a certain point it needs to fallback to block level copying, with a ~ 
>> 32KB block. It can't be treating things as if they're 1K files, doing file 
>> level copying that takes forever. It's just too risky that another device 
>> fails in the meantime.
> 
> There's "device replace" for restoring redundancy, which is fast, but not 
> implemented yet for RAID5/6.

Now my colleague and I is implementing the scrub/replace for RAID5/6
and I have a plan to reimplement the balance and split it off from the 
metadata/file data process. the main idea is
- allocate a new chunk which has the same size as the relocated one, but don't 
insert it into the block group list, so we don't
  allocate the free space from it.
- set the source chunk to be Read-only
- copy the data from the source chunk to the new chunk
- replace the extent map of the source chunk with the one of the new chunk(The 
new chunk has
  the same logical address and the length as the old one)
- release the source chunk

By this way, we needn't deal the data one extent by one extent, and needn't do 
any space reservation,
so the speed will be very fast even we have lots of snapshots.

Thanks
Miao

> 
> I think the problem is that balance was originally used for balancing data / 
> metadata split - moving stuff out of mostly empty chunks to free them and use 
> for something else. It pretty much has to be done on the extent level.
> 
> Then balance was repurposed for things like converting RAID profiles and 
> restoring redundancy and balancing device usage in multi-device 
> configurations. It works, but the approach to do it extent by extent is slow.
> 
> I wonder if we could do some of these operations by just copying whole chunks 
> in bulk. Wasn't that the point of introducing logical addresses? - to be able 
> to move chunks around quickly without changing anything except updating chunk 
> pointers?
> 
> BTW: I'd love a simple interface to be able to select a chunk and tell it to 
> move somewhere else. I'd like to tell chunks with metadata, or with tons of 
> extents: Hey, chunks! Why don't you move to my SSDs? :)
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: don't do async reclaim during log replay V2

2014-10-23 Thread Miao Xie

On Thu, 18 Sep 2014 11:27:17 -0400, Josef Bacik wrote:
> Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
> during log replay.  This is because we use fs_info->fs_root as our root for
> shrinking and such.  Technically we can use whatever root we want, but let's
> just not allow async reclaim while we're doing log replay.  Thanks,

Why not move the code of fs_root initialization to the front of log replay?
I think it is better than the fix way in this patch because the async reclaimer
can help us do some work.

Thanks
Miao

> 
> Signed-off-by: Josef Bacik 
> ---
> V1->V2: use fs_info->log_root_recovering instead, didn't notice this existed
> before.
> 
>  fs/btrfs/extent-tree.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 28a27d5..44d0497 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4513,7 +4513,13 @@ again:
>   space_info->flush = 1;
>   } else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
>   used += orig_bytes;
> - if (need_do_async_reclaim(space_info, root->fs_info, used) &&
> + /*
> +  * We will do the space reservation dance during log replay,
> +  * which means we won't have fs_info->fs_root set, so don't do
> +  * the async reclaim as we will panic.
> +  */
> + if (!root->fs_info->log_root_recovering &&
> + need_do_async_reclaim(space_info, root->fs_info, used) &&
>   !work_busy(&root->fs_info->async_reclaim_work))
>   queue_work(system_unbound_wq,
>  &root->fs_info->async_reclaim_work);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: properly clean up btrfs_end_io_wq_cache

2014-10-23 Thread Miao Xie

On Wed, 15 Oct 2014 17:19:59 -0400, Josef Bacik wrote:
> In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on
> unload, which makes us unable to unload and then re-load the btrfs module.  
> This
> fixes the problem.  Thanks,
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Miao Xie 

> ---
>  fs/btrfs/super.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index b83ef15..c1d020f 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2151,6 +2151,7 @@ static void __exit exit_btrfs_fs(void)
>   extent_map_exit();
>   extent_io_exit();
>   btrfs_interface_exit();
> + btrfs_end_io_wq_exit();
>   unregister_filesystem(&btrfs_fs_type);
>   btrfs_exit_sysfs();
>   btrfs_cleanup_fs_uuids();
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Btrfs: check-int: don't complain about balanced blocks

2014-10-17 Thread Miao Xie

On Thu, 16 Oct 2014 17:48:49 +0200, Stefan Behrens wrote:
> The xfstest btrfs/014 which tests the balance operation caused that the
> check_int module complained that known blocks changed their physical
> location. Since this is not an error in this case, only print such
> message if the verbose mode was enabled.
> 
> Reported-by: Wang Shilong 
> Signed-off-by: Stefan Behrens 
> ---
>  fs/btrfs/check-integrity.c | 87 
> ++
>  1 file changed, 49 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index 65fc2e0bbc4a..65226d7c9fe0 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -1325,24 +1325,28 @@ static int btrfsic_create_link_to_next_block(
>   l = NULL;
>   next_block->generation = BTRFSIC_GENERATION_UNKNOWN;
>   } else {
> - if (next_block->logical_bytenr != next_bytenr &&
> - !(!next_block->is_metadata &&
> -   0 == next_block->logical_bytenr)) {
> - printk(KERN_INFO
> -"Referenced block @%llu (%s/%llu/%d)"
> -" found in hash table, %c,"
> -" bytenr mismatch (!= stored %llu).\n",
> -next_bytenr, next_block_ctx->dev->name,
> -next_block_ctx->dev_bytenr, *mirror_nump,
> -btrfsic_get_block_type(state, next_block),
> -next_block->logical_bytenr);
> - } else if (state->print_mask & BTRFSIC_PRINT_MASK_VERBOSE)
> - printk(KERN_INFO
> -"Referenced block @%llu (%s/%llu/%d)"
> -" found in hash table, %c.\n",
> -next_bytenr, next_block_ctx->dev->name,
> -next_block_ctx->dev_bytenr, *mirror_nump,
> -btrfsic_get_block_type(state, next_block));
> + if (state->print_mask & BTRFSIC_PRINT_MASK_VERBOSE) {
> + if (next_block->logical_bytenr != next_bytenr &&
> + !(!next_block->is_metadata &&
> +   0 == next_block->logical_bytenr))
> + printk(KERN_INFO
> +"Referenced block @%llu (%s/%llu/%d)"
> +" found in hash table, %c,"
> +" bytenr mismatch (!= stored %llu).\n",

According to the coding style, we don't expect the user-visible strings are 
broken.

Thanks
Miao

> +next_bytenr, next_block_ctx->dev->name,
> +next_block_ctx->dev_bytenr, *mirror_nump,
> +btrfsic_get_block_type(state,
> +   next_block),
> +next_block->logical_bytenr);
> + else
> + printk(KERN_INFO
> +"Referenced block @%llu (%s/%llu/%d)"
> +" found in hash table, %c.\n",
> +next_bytenr, next_block_ctx->dev->name,
> +next_block_ctx->dev_bytenr, *mirror_nump,
> +btrfsic_get_block_type(state,
> +   next_block));
> + }
>   next_block->logical_bytenr = next_bytenr;
>  
>   next_block->mirror_num = *mirror_nump;
> @@ -1528,7 +1532,9 @@ static int btrfsic_handle_extent_data(
>   return -1;
>   }
>   if (!block_was_created) {
> - if (next_block->logical_bytenr != next_bytenr &&
> + if ((state->print_mask &
> +  BTRFSIC_PRINT_MASK_VERBOSE) &&
> + next_block->logical_bytenr != next_bytenr &&
>   !(!next_block->is_metadata &&
> 0 == next_block->logical_bytenr)) {
>   printk(KERN_INFO
> @@ -1881,25 +1887,30 @@ again:
>  dev_state,
>  dev_bytenr);
>   }
> - if (block->logical_bytenr != bytenr &&
> - !(!block->is_metadata &&
> -   block->logical_bytenr == 0))
> - printk(KERN_INFO
> -"Written block @%llu (%s/%llu/%d)"
> -" found in hash table, %c,"
> -

Re: [PATCH] Btrfs: return failure if btrfs_dev_replace_finishing() failed

2014-10-12 Thread Miao Xie

Guan

On Sat, 11 Oct 2014 14:45:29 +0800, Eryu Guan wrote:
 device replace could fail due to another running scrub process, but this
 failure doesn't get returned to userspace.

 The following steps could reproduce this issue

mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
mount /dev/sdb1 /mnt/btrfs
while true; do
btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1
done &
btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
# if this replace succeeded, do the following and repeat until
# you see this log in dmesg
# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs

# once you see the error log in dmesg, check return value of
# replace
echo $?

 Also only WARN_ON if the return code is not -EINPROGRESS.

 Signed-off-by: Eryu Guan 
>>>
>>> Ping, any comments on this patch?
>>>
>>> Thanks,
>>> Eryu
 ---
  fs/btrfs/dev-replace.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index eea26e1..44d32ab 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -418,9 +418,11 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
  &dev_replace->scrub_progress, 0, 1);
  
ret = btrfs_dev_replace_finishing(root->fs_info, ret);
 -  WARN_ON(ret);
 +  /* don't warn if EINPROGRESS, someone else might be running scrub */
 +  if (ret != -EINPROGRESS)
 +  WARN_ON(ret);
>>
>> picky comment
>>
>> I prefer WARN_ON(ret && ret != -EINPROGRESS).
> 
> Yes, this is simpler :)
>>
  
 -  return 0;
 +  return ret;
>>
>> here we will return -EINPROGRESS if scrub is running, I think it better that
>> we assign some special number to args->result, and then return 0, just like
>> the case the device replace is running.
> 
> Seems that requires a new result type, say,
> 
> #define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS  3
> 
> and assign this result to args->result if btrfs_scrub_dev() returned 
> -EINPROGRESS
> 
> But I don't think returning 0 unconditionally is a good idea, since
> btrfs_dev_replace_finishing() could return other errors too, that way
> these errors will be lost, and userspace still won't catch the
> errors ($? is 0)

Of course.
Maybe the above explanation of mine was not so clear. In fact, I just talked 
about
the EINPROGRESS case, for the other case, returning error code is better.

> What I'm thinking about is something like:
> 
>   ret = btrfs_scrub_dev(...);
>   ret = btrfs_dev_replace_finishing(root->fs_info, ret);
>   if (ret == -EINPROGRESS) {
>   args->result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
>   ret = 0;
>   } else {
>   WARN_ON(ret);
>   }
> 
>   return ret;
> 
> What do you think? If no objection I'll work on v2.

I like it.

Thanks
Miao

> Thanks for your review!
> 
> Eryu
>>
>> Thanks
>> Miao
>>
  
  leave:
dev_replace->srcdev = NULL;
 @@ -538,7 +540,7 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
  
 -  return 0;
 +  return scrub_ret;
}
  
printk_in_rcu(KERN_INFO
 -- 
 1.8.3.1

>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: return failure if btrfs_dev_replace_finishing() failed

2014-10-10 Thread Miao Xie

On Fri, 10 Oct 2014 15:13:31 +0800, Eryu Guan wrote:
> On Thu, Sep 25, 2014 at 06:28:14PM +0800, Eryu Guan wrote:
>> device replace could fail due to another running scrub process, but this
>> failure doesn't get returned to userspace.
>>
>> The following steps could reproduce this issue
>>
>>  mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
>>  mount /dev/sdb1 /mnt/btrfs
>>  while true; do
>>  btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1
>>  done &
>>  btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
>>  # if this replace succeeded, do the following and repeat until
>>  # you see this log in dmesg
>>  # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
>>  #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
>>
>>  # once you see the error log in dmesg, check return value of
>>  # replace
>>  echo $?
>>
>> Also only WARN_ON if the return code is not -EINPROGRESS.
>>
>> Signed-off-by: Eryu Guan 
> 
> Ping, any comments on this patch?
> 
> Thanks,
> Eryu
>> ---
>>  fs/btrfs/dev-replace.c | 8 +---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index eea26e1..44d32ab 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -418,9 +418,11 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
>>&dev_replace->scrub_progress, 0, 1);
>>  
>>  ret = btrfs_dev_replace_finishing(root->fs_info, ret);
>> -WARN_ON(ret);
>> +/* don't warn if EINPROGRESS, someone else might be running scrub */
>> +if (ret != -EINPROGRESS)
>> +WARN_ON(ret);

picky comment

I prefer WARN_ON(ret && ret != -EINPROGRESS).

>>  
>> -return 0;
>> +return ret;

here we will return -EINPROGRESS if scrub is running, I think it better that
we assign some special number to args->result, and then return 0, just like
the case the device replace is running.

Thanks
Miao

>>  
>>  leave:
>>  dev_replace->srcdev = NULL;
>> @@ -538,7 +540,7 @@ static int btrfs_dev_replace_finishing(struct 
>> btrfs_fs_info *fs_info,
>>  btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
>>  mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
>>  
>> -return 0;
>> +return scrub_ret;
>>  }
>>  
>>  printk_in_rcu(KERN_INFO
>> -- 
>> 1.8.3.1
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2 v5] Btrfs: be aware of btree inode write errors to avoid fs corruption

2014-09-24 Thread Miao Xie

On Wed, 24 Sep 2014 11:28:26 +0100, Filipe Manana wrote:
[SNIP]
>  int btrfs_wait_marked_extents(struct btrfs_root *root,
> +   struct btrfs_trans_handle *trans,
> struct extent_io_tree *dirty_pages, int mark)
>  {
>   int err = 0;
> @@ -852,6 +855,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>   struct extent_state *cached_state = NULL;
>   u64 start = 0;
>   u64 end;
> + int errors;
>  
>   while (!find_first_extent_bit(dirty_pages, start, &start, &end,
> EXTENT_NEED_WAIT, &cached_state)) {
> @@ -865,6 +869,16 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>   }
>   if (err)
>   werr = err;
> +
> + if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
> + errors = atomic_xchg(
> + &trans->transaction->log_eb_write_errors, 0);
> + else
> + errors = atomic_xchg(&trans->transaction->eb_write_errors, 0);

There is a bug in log tree commit case.
As we know, each fs/file tree has two log sub-transaction, when we are 
committing
a log sub-transaction, the other one may be started by the other file sync 
tasks.
It is very likely that there is no any error happens in the former, but some 
write
errors happen in the latter. The above code might clear the number of that 
errors.

So I think the variant for log write error should be created for each log 
sub-transaction.

Thanks
Miao

> +
> + if (errors && !werr)
> + werr = -EIO;
> +
>   return werr;
>  }
>  
> @@ -874,6 +888,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>   * those extents are on disk for transaction or log commit
>   */
>  static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
> + struct btrfs_trans_handle *trans,
>   struct extent_io_tree *dirty_pages, int mark)
>  {
>   int ret;
> @@ -883,7 +898,7 @@ static int btrfs_write_and_wait_marked_extents(struct 
> btrfs_root *root,
>   blk_start_plug(&plug);
>   ret = btrfs_write_marked_extents(root, dirty_pages, mark);
>   blk_finish_plug(&plug);
> - ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
> + ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark);
>  
>   if (ret)
>   return ret;
> @@ -892,7 +907,7 @@ static int btrfs_write_and_wait_marked_extents(struct 
> btrfs_root *root,
>   return 0;
>  }
>  
> -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
> +static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>struct btrfs_root *root)
>  {
>   if (!trans || !trans->transaction) {
> @@ -900,7 +915,7 @@ int btrfs_write_and_wait_transaction(struct 
> btrfs_trans_handle *trans,
>   btree_inode = root->fs_info->btree_inode;
>   return filemap_write_and_wait(btree_inode->i_mapping);
>   }
> - return btrfs_write_and_wait_marked_extents(root,
> + return btrfs_write_and_wait_marked_extents(root, trans,
>  &trans->transaction->dirty_pages,
>  EXTENT_DIRTY);
>  }
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 7dd558e..311f3e3 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -46,6 +46,8 @@ struct btrfs_transaction {
>*/
>   atomic_t num_writers;
>   atomic_t use_count;
> + atomic_t eb_write_errors;
> + atomic_t log_eb_write_errors;
>  
>   /* Be protected by fs_info->trans_lock when we want to change it. */
>   enum btrfs_trans_state state;
> @@ -146,8 +148,6 @@ struct btrfs_trans_handle 
> *btrfs_attach_transaction_barrier(
>   struct btrfs_root *root);
>  struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root 
> *root);
>  int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
> -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
> -  struct btrfs_root *root);
>  
>  void btrfs_add_dead_root(struct btrfs_root *root);
>  int btrfs_defrag_root(struct btrfs_root *root);
> @@ -167,6 +167,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle 
> *trans,
>  int btrfs_write_marked_extents(struct btrfs_root *root,
>   struct extent_io_tree *dirty_pages, int mark);
>  int btrfs_wait_marked_extents(struct btrfs_root *root,
> +   struct btrfs_trans_handle *trans,
>   struct extent_io_tree *dirty_pages, int mark);
>  int btrfs_transaction_blocked(struct btrfs_fs_info *info);
>  int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 2d0fa43..22ffd32 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs

Re: [PATCH] btrfs: Fix the wrong condition judgment about subset extent map

2014-09-21 Thread Miao Xie

This patch and the previous one(The following patch) also fixed a oops, which 
can be reproduced
by LTP stress test(ltpstress.sh + fsstress).

[PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted 
extent map

Thanks
Miao

On Mon, 22 Sep 2014 09:13:03 +0800, Qu Wenruo wrote:
> Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
> best fitted extent map
> is using wrong condition to judgement whether the range is a subset of a
> existing extent map.
> 
> This may cause bug in btrfs no-holes mode.
> 
> This patch will correct the judgment and fix the bug.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8039021..a99ee9d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6527,7 +6527,7 @@ insert:
>* extent causing the -EEXIST.
>*/
>   if (start >= extent_map_end(existing) ||
> - start + len <= existing->start) {
> + start <= existing->start) {
>   /*
>* The existing extent map is the one nearest to
>* the [start, start + len) range which overlaps
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: fix ABBA deadlock in btrfs_dev_replace_finishing()

2014-09-21 Thread Miao Xie

It has been fixed by

https://patchwork.kernel.org/patch/4747961/

Thanks
Miao

On Sun, 21 Sep 2014 12:41:49 +0800, Eryu Guan wrote:
> btrfs_map_bio() first calls btrfs_bio_counter_inc_blocked() which checks
> fs state and increase bio_counter, then calls __btrfs_map_block() which
> will take the dev_replace lock.
> 
> On the other hand, btrfs_dev_replace_finishing() takes dev_replace lock
> first then set fs state to BTRFS_FS_STATE_DEV_REPLACING and waits for
> bio_counter to be zero.
> 
> The deadlock can be reproduced easily by running replace and fsstress at
> the same time, e.g.
> 
> mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
> mount /dev/sdb1 /mnt/btrfs
> fsstress -d /mnt/btrfs -n 100 -p 2 -l 0 & # fsstress from ltp supports -l 
> option
> i=0
> while btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs && \
>   btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs; do
>   echo "=== loop $i ==="
>   let i=$i+1
> done
> 
> This was introduced by
> 
> c404e0d Btrfs: fix use-after-free in the finishing procedure of the device 
> replace
> 
> Signed-off-by: Eryu Guan 
> ---
> 
> Tested by the reproducer and xfstests, no new failure found.
> 
> But I found kmem_cache leak if I remove btrfs module after my new test 
> case[1],
> which does fsstress & replace & subvolume create/mount/umount/delete at the 
> same
> time.
> 
> BUG btrfs_extent_state (Tainted: GB ): Objects remaining in 
> btrfs_extent_state on kmem_cache_close()
> ..
> kmem_cache_destroy btrfs_extent_state: Slab cache still has objects
> CPU: 3 PID: 9503 Comm: modprobe Tainted: GB  3.17.0-rc5+ #12
> Hardware name: Hewlett-Packard ProLiant DL388eGen8, BIOS P73 06/01/2012
>   8dd09c52 880411c37eb0 81642f7a
>  8800b9a19300 880411c37ed0 8118ce89 
>  a05dcd20 880411c37ee0 a056a80f 880411c37ef0
> Call Trace:
>  [] dump_stack+0x45/0x56
>  [] kmem_cache_destroy+0xf9/0x100
>  [] extent_io_exit+0x1f/0x50 [btrfs]
>  [] exit_btrfs_fs+0x2c/0x549 [btrfs]
>  [] SyS_delete_module+0x162/0x200
>  [] ? do_notify_resume+0x97/0xb0
>  [] system_call_fastpath+0x16/0x1b
> 
> The test would hang before the fix. I'm not sure if it's related to the fix
> (seems not), please help review.
> 
> Thanks,
> Eryu Guan
> 
> [1] http://www.spinics.net/lists/linux-btrfs/msg37625.html
> 
>  fs/btrfs/dev-replace.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index eea26e1..5dfd292 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -510,6 +510,7 @@ static int btrfs_dev_replace_finishing(struct 
> btrfs_fs_info *fs_info,
>   /* keep away write_all_supers() during the finishing procedure */
>   mutex_lock(&root->fs_info->chunk_mutex);
>   mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
> + btrfs_rm_dev_replace_blocked(fs_info);
>   btrfs_dev_replace_lock(dev_replace);
>   dev_replace->replace_state =
>   scrub_ret ? BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
> @@ -567,12 +568,8 @@ static int btrfs_dev_replace_finishing(struct 
> btrfs_fs_info *fs_info,
>   btrfs_kobj_rm_device(fs_info, src_device);
>   btrfs_kobj_add_device(fs_info, tgt_device);
>  
> - btrfs_rm_dev_replace_blocked(fs_info);
> -
>   btrfs_rm_dev_replace_srcdev(fs_info, src_device);
>  
> - btrfs_rm_dev_replace_unblocked(fs_info);
> -
>   /*
>* this is again a consistent state where no dev_replace procedure
>* is running, the target device is part of the filesystem, the
> @@ -581,6 +578,7 @@ static int btrfs_dev_replace_finishing(struct 
> btrfs_fs_info *fs_info,
>* belong to this filesystem.
>*/
>   btrfs_dev_replace_unlock(dev_replace);
> + btrfs_rm_dev_replace_unblocked(fs_info);
>   mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>   mutex_unlock(&root->fs_info->chunk_mutex);
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel integration branch updated

2014-09-18 Thread Miao Xie

Chris

On Fri, 19 Sep 2014 09:45:17 +0800, Qu Wenruo wrote:
> Hi Chris,
> 
> 
> I'm sorry that the commit 'btrfs: Fix and enhance merge_extent_mapping() to 
> insert best fitted extent map'
> has a V2 patch, so the one in tree is not up-to-data.
> 
> Although the v2 change is quite small and it's relevantly dependent, so it 
> should not be a pain change.

I think it is better to merge it to v3.17 since it is a regression of v3.17 
kernel

Thanks
Miao 

> Thanks,
> Qu
> 
>  Original Message 
> Subject: kernel integration branch updated
> From: Chris Mason 
> To: linux-btrfs 
> Date: 2014年09月18日 22:19
>> Hi everyone,
>>
>> I've added a few more patches to the kernel integration branch, and
>> rebased onto rc5.  This should be my last rebase before sending into
>> linux-next, please take a look.
>>
>> It's still missing three patches from Josef, which we're updating.  I
>> can put more patches on top, but I'd prefer not to rebase again unless
>> some patches need removing.
>>
>> -chris
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 09/11] Btrfs: Set real mirror number for read operation on RAID0/5/6

2014-09-12 Thread Miao Xie

We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/volumes.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1aacf5f..4856547 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5073,6 +5073,8 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
num_stripes = min_t(u64, map->num_stripes,
stripe_nr_end - stripe_nr_orig);
stripe_index = do_div(stripe_nr, map->num_stripes);
+   if (!(rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS)))
+   mirror_num = 1;
} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
if (rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS))
num_stripes = map->num_stripes;
@@ -5176,6 +5178,9 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
/* We distribute the parity blocks across stripes */
tmp = stripe_nr + stripe_index;
stripe_index = do_div(tmp, map->num_stripes);
+   if (!(rw & (REQ_WRITE | REQ_DISCARD |
+   REQ_GET_READ_MIRRORS)) && mirror_num <= 1)
+   mirror_num = 1;
}
} else {
/*
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 01/11] Btrfs: load checksum data once when submitting a direct read io

2014-09-12 Thread Miao Xie

The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie 
---
Changelog v3 -> v4:
- None

Changelog v2 -> v3:
- Fix the wrong return value of btrfs_bio_clone

Changelog v1 -> v2:
- Remove the __GFP_ZERO flag in btrfs_submit_direct because it would trigger
  a WARNing. It is reported by Filipe David Manana, Thanks.
---
 fs/btrfs/btrfs_inode.h |  1 -
 fs/btrfs/ctree.h   |  3 +--
 fs/btrfs/extent_io.c   | 13 +++--
 fs/btrfs/file-item.c   | 14 ++
 fs/btrfs/inode.c   | 38 +-
 5 files changed, 35 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index fd87941..8bea70e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -263,7 +263,6 @@ struct btrfs_dio_private {
 
/* dio_bio came from fs/direct-io.c */
struct bio *dio_bio;
-   u8 csum[0];
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ded7781..7b54cd9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3719,8 +3719,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
  struct bio *bio, u32 *dst);
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
- struct btrfs_dio_private *dip, struct bio *bio,
- u64 logical_offset);
+ struct bio *bio, u64 logical_offset);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 objectid, u64 pos,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 86b39de..dfe1afe 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2621,9 +2621,18 @@ btrfs_bio_alloc(struct block_device *bdev, u64 
first_sector, int nr_vecs,
 
 struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
 {
-   return bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
-}
+   struct btrfs_io_bio *btrfs_bio;
+   struct bio *new;
 
+   new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
+   if (new) {
+   btrfs_bio = btrfs_io_bio(new);
+   btrfs_bio->csum = NULL;
+   btrfs_bio->csum_allocated = NULL;
+   btrfs_bio->end_io = NULL;
+   }
+   return new;
+}
 
 /* this also allocates from the btrfs_bioset */
 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 6e6262e..783a943 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct 
inode *inode,
 }
 
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
- struct btrfs_dio_private *dip, struct bio *bio,
- u64 offset)
+ struct bio *bio, u64 offset)
 {
-   int len = (bio->bi_iter.bi_sector << 9) - dip->disk_bytenr;
-   u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
-   int ret;
-
-   len >>= inode->i_sb->s_blocksize_bits;
-   len *= csum_size;
-
-   ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
- (u32 *)(dip->csum + len), 1);
-   return ret;
+   return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
 }
 
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2118ea6..af304e1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7196,7 +7196,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int 
err)
struct inode *inode = dip->inode;
struct btrfs_root *root = BTRFS_I(inode)->root;
struct bio *dio_bio;
-   u32 *csums = (u32 *)dip->csum;
+   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+   u32 *csums = (u32 *)io_bio->csum;
u64 start;
int i;
 
@@ -7238,6 +7239,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int 
err)
if (err)
clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
dio_end_io(dio_bio, err);
+
+   if (io_bio->end_io)
+   io_bio->end_io(io_bio, err);
bio_put(bio);
 }
 
@@ -7377,13 +7381,20 @@ static inline int __btrfs_submit_dio_bio(struct bio 
*bio, struct inode *inode,
ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
if (ret)

[PATCH v4 08/11] Btrfs: modify clean_io_failure and make it suit direct io

2014-09-12 Thread Miao Xie

We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 31 +++
 fs/btrfs/extent_io.h |  6 +++---
 fs/btrfs/scrub.c |  3 +--
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9fbc005..94c5c04 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1995,10 +1995,10 @@ static int free_io_failure(struct inode *inode, struct 
io_failure_record *rec)
  * currently, there can be no more than two copies of every data bit. thus,
  * exactly one rewrite is required.
  */
-int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
-   u64 length, u64 logical, struct page *page,
-   unsigned int pg_offset, int mirror_num)
+int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
+ struct page *page, unsigned int pg_offset, int mirror_num)
 {
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
struct bio *bio;
struct btrfs_device *dev;
u64 map_length = 0;
@@ -2046,10 +2046,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 
start,
}
 
printk_ratelimited_in_rcu(KERN_INFO
-   "BTRFS: read error corrected: ino %lu off %llu "
-   "(dev %s sector %llu)\n", page->mapping->host->i_ino,
-   start, rcu_str_deref(dev->name), sector);
-
+ "BTRFS: read error corrected: ino %llu off 
%llu (dev %s sector %llu)\n",
+ btrfs_ino(inode), start,
+ rcu_str_deref(dev->name), sector);
bio_put(bio);
return 0;
 }
@@ -2066,9 +2065,10 @@ int repair_eb_io_failure(struct btrfs_root *root, struct 
extent_buffer *eb,
 
for (i = 0; i < num_pages; i++) {
struct page *p = extent_buffer_page(eb, i);
-   ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
-   start, p, start - page_offset(p),
-   mirror_num);
+
+   ret = repair_io_failure(root->fs_info->btree_inode, start,
+   PAGE_CACHE_SIZE, start, p,
+   start - page_offset(p), mirror_num);
if (ret)
break;
start += PAGE_CACHE_SIZE;
@@ -2081,12 +2081,12 @@ int repair_eb_io_failure(struct btrfs_root *root, 
struct extent_buffer *eb,
  * each time an IO finishes, we do a fast check in the IO failure tree
  * to see if we need to process or clean up an io_failure_record
  */
-static int clean_io_failure(u64 start, struct page *page)
+static int clean_io_failure(struct inode *inode, u64 start,
+   struct page *page, unsigned int pg_offset)
 {
u64 private;
u64 private_failure;
struct io_failure_record *failrec;
-   struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
struct extent_state *state;
int num_copies;
@@ -2126,10 +2126,9 @@ static int clean_io_failure(u64 start, struct page *page)
num_copies = btrfs_num_copies(fs_info, failrec->logical,
  failrec->len);
if (num_copies > 1)  {
-   repair_io_failure(fs_info, start, failrec->len,
+   repair_io_failure(inode, start, failrec->len,
  failrec->logical, page,
- start - page_offset(page),
- failrec->failed_mirror);
+ pg_offset, failrec->failed_mirror);
}
}
 
@@ -2538,7 +2537,7 @@ static void end_bio_extent_readpage(struct bio *bio, int 
err)
if (ret)
uptodate = 0;
else
-   clean_io_failure(start, page);
+   clean_io_failure(inode, start, page, 0);
}
 
if (likely(uptodate))
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index a82ecbc..bf0597f 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -338,9 +338,9 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t 
gfp_mask);
 
 struct btrfs_fs_info;
 
-int repair_io_failure(struct btrfs_fs_info *fs_info, u64 star

[PATCH v4 11/11] Btrfs: cleanup the read failure record after write or when the inode is freeing

2014-09-12 Thread Miao Xie

After the data is written successfully, we should cleanup the read failure 
record
in that range because
- If we set data COW for the file, the range that the failure record pointed to 
is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak 
happens.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 34 ++
 fs/btrfs/extent_io.h |  1 +
 fs/btrfs/inode.c |  6 ++
 3 files changed, 41 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 86dc352..5427fd5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2138,6 +2138,40 @@ out:
return 0;
 }
 
+/*
+ * Can be called when
+ * - hold extent lock
+ * - under ordered extent
+ * - the inode is freeing
+ */
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end)
+{
+   struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
+   struct io_failure_record *failrec;
+   struct extent_state *state, *next;
+
+   if (RB_EMPTY_ROOT(&failure_tree->state))
+   return;
+
+   spin_lock(&failure_tree->lock);
+   state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
+   while (state) {
+   if (state->start > end)
+   break;
+
+   ASSERT(state->end <= end);
+
+   next = next_state(state);
+
+   failrec = (struct io_failure_record *)state->private;
+   free_extent_state(state);
+   kfree(failrec);
+
+   state = next;
+   }
+   spin_unlock(&failure_tree->lock);
+}
+
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
struct io_failure_record **failrec_ret)
 {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 176a4b1..5e91fb9 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -366,6 +366,7 @@ struct io_failure_record {
int in_validation;
 };
 
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end);
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
struct io_failure_record **failrec_ret);
 int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bc8cdaf..c591af5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2697,6 +2697,10 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
goto out;
}
 
+   btrfs_free_io_failure_record(inode, ordered_extent->file_offset,
+ordered_extent->file_offset +
+ordered_extent->len - 1);
+
if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
truncated = true;
logical_len = ordered_extent->truncated_len;
@@ -4792,6 +4796,8 @@ void btrfs_evict_inode(struct inode *inode)
/* do we really want it for ->i_nlink > 0 and zero btrfs_root_refs? */
btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
+   btrfs_free_io_failure_record(inode, 0, (u64)-1);
+
if (root->fs_info->log_root_recovering) {
BUG_ON(test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
 &BTRFS_I(inode)->runtime_flags));
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 06/11] Btrfs: split bio_readpage_error into several functions

2014-09-12 Thread Miao Xie

The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 159 ++-
 fs/btrfs/extent_io.h |  28 +
 2 files changed, 123 insertions(+), 64 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 154cb8e..cf1de40 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1962,25 +1962,6 @@ static void check_page_uptodate(struct extent_io_tree 
*tree, struct page *page)
SetPageUptodate(page);
 }
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the page is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-   struct page *page;
-   u64 start;
-   u64 len;
-   u64 logical;
-   unsigned long bio_flags;
-   int this_mirror;
-   int failed_mirror;
-   int in_validation;
-};
-
 static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
int ret;
@@ -2156,40 +2137,24 @@ out:
return 0;
 }
 
-/*
- * this is a generic handler for readpage errors (default
- * readpage_io_failed_hook). if other copies exist, read those and write back
- * good data to the failed position. does not investigate in remapping the
- * failed extent elsewhere, hoping the device will be smart enough to do this 
as
- * needed
- */
-
-static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
- struct page *page, u64 start, u64 end,
- int failed_mirror)
+int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
+   struct io_failure_record **failrec_ret)
 {
-   struct io_failure_record *failrec = NULL;
+   struct io_failure_record *failrec;
u64 private;
struct extent_map *em;
-   struct inode *inode = page->mapping->host;
struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-   struct bio *bio;
-   struct btrfs_io_bio *btrfs_failed_bio;
-   struct btrfs_io_bio *btrfs_bio;
-   int num_copies;
int ret;
-   int read_mode;
u64 logical;
 
-   BUG_ON(failed_bio->bi_rw & REQ_WRITE);
-
ret = get_state_private(failure_tree, start, &private);
if (ret) {
failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
if (!failrec)
return -ENOMEM;
+
failrec->start = start;
failrec->len = end - start + 1;
failrec->this_mirror = 0;
@@ -2209,11 +2174,11 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
em = NULL;
}
read_unlock(&em_tree->lock);
-
if (!em) {
kfree(failrec);
return -EIO;
}
+
logical = start - em->start;
logical = em->block_start + logical;
if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
@@ -,8 +2187,10 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
extent_set_compress_type(&failrec->bio_flags,
 em->compress_type);
}
-   pr_debug("bio_readpage_error: (new) logical=%llu, start=%llu, "
-"len=%llu\n", logical, start, failrec->len);
+
+   pr_debug("Get IO Failure Record: (new) logical=%llu, 
start=%llu, len=%llu\n",
+logical, start, failrec->len);
+
failrec->logical = logical;
free_extent_map(em);
 
@@ -2243,8 +2210,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
}
} else {
failrec = (struct io_failure_record *)(unsigned long)private;
-   pr_debug("bio_readpage_error: (found) logical=%llu, "
-"start=%llu, len=%llu, validation=%d\n",
+   pr_debug("Get IO Failure Record: (found) logical=%llu, 
start=%llu, len=%llu, validation=%d\n",
 failrec->logical, failr

[PATCH v4 02/11] Btrfs: cleanup similar code of the buffered data data check and dio read data check

2014-09-12 Thread Miao Xie

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/inode.c | 102 +--
 1 file changed, 47 insertions(+), 55 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index af304e1..e8139c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2893,6 +2893,40 @@ static int btrfs_writepage_end_io_hook(struct page 
*page, u64 start, u64 end,
return 0;
 }
 
+static int __readpage_endio_check(struct inode *inode,
+ struct btrfs_io_bio *io_bio,
+ int icsum, struct page *page,
+ int pgoff, u64 start, size_t len)
+{
+   char *kaddr;
+   u32 csum_expected;
+   u32 csum = ~(u32)0;
+   static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+   csum_expected = *(((u32 *)io_bio->csum) + icsum);
+
+   kaddr = kmap_atomic(page);
+   csum = btrfs_csum_data(kaddr + pgoff, csum,  len);
+   btrfs_csum_final(csum, (char *)&csum);
+   if (csum != csum_expected)
+   goto zeroit;
+
+   kunmap_atomic(kaddr);
+   return 0;
+zeroit:
+   if (__ratelimit(&_rs))
+   btrfs_info(BTRFS_I(inode)->root->fs_info,
+  "csum failed ino %llu off %llu csum %u expected csum 
%u",
+  btrfs_ino(inode), start, csum, csum_expected);
+   memset(kaddr + pgoff, 1, len);
+   flush_dcache_page(page);
+   kunmap_atomic(kaddr);
+   if (csum_expected == 0)
+   return 0;
+   return -EIO;
+}
+
 /*
  * when reads are done, we need to check csums to verify the data is correct
  * if there's a match, we allow the bio to finish.  If not, the code in
@@ -2905,20 +2939,15 @@ static int btrfs_readpage_end_io_hook(struct 
btrfs_io_bio *io_bio,
size_t offset = start - page_offset(page);
struct inode *inode = page->mapping->host;
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
-   char *kaddr;
struct btrfs_root *root = BTRFS_I(inode)->root;
-   u32 csum_expected;
-   u32 csum = ~(u32)0;
-   static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
 
if (PageChecked(page)) {
ClearPageChecked(page);
-   goto good;
+   return 0;
}
 
if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-   goto good;
+   return 0;
 
if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
@@ -2928,28 +2957,8 @@ static int btrfs_readpage_end_io_hook(struct 
btrfs_io_bio *io_bio,
}
 
phy_offset >>= inode->i_sb->s_blocksize_bits;
-   csum_expected = *(((u32 *)io_bio->csum) + phy_offset);
-
-   kaddr = kmap_atomic(page);
-   csum = btrfs_csum_data(kaddr + offset, csum,  end - start + 1);
-   btrfs_csum_final(csum, (char *)&csum);
-   if (csum != csum_expected)
-   goto zeroit;
-
-   kunmap_atomic(kaddr);
-good:
-   return 0;
-
-zeroit:
-   if (__ratelimit(&_rs))
-   btrfs_info(root->fs_info, "csum failed ino %llu off %llu csum 
%u expected csum %u",
-   btrfs_ino(page->mapping->host), start, csum, 
csum_expected);
-   memset(kaddr + offset, 1, end - start + 1);
-   flush_dcache_page(page);
-   kunmap_atomic(kaddr);
-   if (csum_expected == 0)
-   return 0;
-   return -EIO;
+   return __readpage_endio_check(inode, io_bio, phy_offset, page, offset,
+ start, (size_t)(end - start + 1));
 }
 
 struct delayed_iput {
@@ -7194,41 +7203,24 @@ static void btrfs_endio_direct_read(struct bio *bio, 
int err)
struct btrfs_dio_private *dip = bio->bi_private;
struct bio_vec *bvec;
struct inode *inode = dip->inode;
-   struct btrfs_root *root = BTRFS_I(inode)->root;
struct bio *dio_bio;
struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-   u32 *csums = (u32 *)io_bio->csum;
u64 start;
+   int ret;
int i;
 
+   if (err || (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
+   goto skip_checksum;
+
start = dip->logical_offset;
bio_for_each_segment_all(bvec, bio, i) {
-   if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-   struct page *page = bvec->bv_page;
-   char *kaddr;
-   u32 csum = ~(u32)0;
-   unsigned long flags;
-
-   local_irq_save(flags);
-   kaddr = kmap_atomic(p

[PATCH v4 10/11] Btrfs: implement repair function when direct read fails

2014-09-12 Thread Miao Xie

This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie 
---
Changelog v3 -> v4:
- Use a dedicated btrfs workqueue instead of the system workqueue to
  deal with the completed repair bio, this suggest was from Chris.

Changelog v1 -> v3:
- None
---
 fs/btrfs/async-thread.c |   1 +
 fs/btrfs/async-thread.h |   1 +
 fs/btrfs/btrfs_inode.h  |   2 +-
 fs/btrfs/ctree.h|   1 +
 fs/btrfs/disk-io.c  |  11 +-
 fs/btrfs/disk-io.h  |   1 +
 fs/btrfs/extent_io.c|  12 ++-
 fs/btrfs/extent_io.h|   5 +-
 fs/btrfs/inode.c| 276 
 9 files changed, 281 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index fbd76de..2da0a66 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -74,6 +74,7 @@ BTRFS_WORK_HELPER(endio_helper);
 BTRFS_WORK_HELPER(endio_meta_helper);
 BTRFS_WORK_HELPER(endio_meta_write_helper);
 BTRFS_WORK_HELPER(endio_raid56_helper);
+BTRFS_WORK_HELPER(endio_repair_helper);
 BTRFS_WORK_HELPER(rmw_helper);
 BTRFS_WORK_HELPER(endio_write_helper);
 BTRFS_WORK_HELPER(freespace_write_helper);
diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
index e9e31c9..e386c29 100644
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -53,6 +53,7 @@ BTRFS_WORK_HELPER_PROTO(endio_helper);
 BTRFS_WORK_HELPER_PROTO(endio_meta_helper);
 BTRFS_WORK_HELPER_PROTO(endio_meta_write_helper);
 BTRFS_WORK_HELPER_PROTO(endio_raid56_helper);
+BTRFS_WORK_HELPER_PROTO(endio_repair_helper);
 BTRFS_WORK_HELPER_PROTO(rmw_helper);
 BTRFS_WORK_HELPER_PROTO(endio_write_helper);
 BTRFS_WORK_HELPER_PROTO(freespace_write_helper);
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 4d30947..7a7521c 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -271,7 +271,7 @@ struct btrfs_dio_private {
 * The original bio may be splited to several sub-bios, this is
 * done during endio of sub-bios
 */
-   int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
+   int (*subio_endio)(struct inode *, struct btrfs_io_bio *, int);
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7b54cd9..63acfd8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1538,6 +1538,7 @@ struct btrfs_fs_info {
struct btrfs_workqueue *endio_workers;
struct btrfs_workqueue *endio_meta_workers;
struct btrfs_workqueue *endio_raid56_workers;
+   struct btrfs_workqueue *endio_repair_workers;
struct btrfs_workqueue *rmw_workers;
struct btrfs_workqueue *endio_meta_write_workers;
struct btrfs_workqueue *endio_write_workers;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ff3ee22..1594d91 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -713,7 +713,11 @@ static void end_workqueue_bio(struct bio *bio, int err)
func = btrfs_endio_write_helper;
}
} else {
-   if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56) {
+   if (unlikely(end_io_wq->metadata ==
+BTRFS_WQ_ENDIO_DIO_REPAIR)) {
+   wq = fs_info->endio_repair_workers;
+   func = btrfs_endio_repair_helper;
+   } else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56) {
wq = fs_info->endio_raid56_workers;
func = btrfs_endio_raid56_helper;
} else if (end_io_wq->metadata) {
@@ -741,6 +745,7 @@ int btrfs_bio_wq_end_io(struct btrfs_fs_info *info, struct 
bio *bio,
int metadata)
 {
struct end_io_wq *end_io_wq;
+
end_io_wq = kmalloc(sizeof(*end_io_wq), GFP_NOFS);
if (!end_io_wq)
return -ENOMEM;
@@ -2059,6 +2064,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info 
*fs_info)
btrfs_destroy_workqueue(fs_info->endio_workers);
btrfs_destroy_workqueue(fs_info->endio_meta_workers);
btrfs_destroy_workqueue(fs_info->endio_raid56_workers);
+   btrfs_destroy_workqueue(fs_info->endio_repair_workers);
btrfs_destroy_workqueue(fs_info->rmw_workers);

[PATCH v4 03/11] Btrfs: do file data check by sub-bio's self

2014-09-12 Thread Miao Xie

Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file 
data
check in the final end io function to the sub-bio end io function, in which we 
can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/btrfs_inode.h |   9 +
 fs/btrfs/extent_io.c   |   2 +-
 fs/btrfs/inode.c   | 100 -
 fs/btrfs/volumes.h |   5 ++-
 4 files changed, 87 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 8bea70e..4d30947 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -245,8 +245,11 @@ static inline int btrfs_inode_in_log(struct inode *inode, 
u64 generation)
return 0;
 }
 
+#define BTRFS_DIO_ORIG_BIO_SUBMITTED   0x1
+
 struct btrfs_dio_private {
struct inode *inode;
+   unsigned long flags;
u64 logical_offset;
u64 disk_bytenr;
u64 bytes;
@@ -263,6 +266,12 @@ struct btrfs_dio_private {
 
/* dio_bio came from fs/direct-io.c */
struct bio *dio_bio;
+
+   /*
+* The original bio may be splited to several sub-bios, this is
+* done during endio of sub-bios
+*/
+   int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
 };
 
 /*
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dfe1afe..92a6d9f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2472,7 +2472,7 @@ static void end_bio_extent_readpage(struct bio *bio, int 
err)
struct inode *inode = page->mapping->host;
 
pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
-"mirror=%lu\n", (u64)bio->bi_iter.bi_sector, err,
+"mirror=%u\n", (u64)bio->bi_iter.bi_sector, err,
 io_bio->mirror_num);
tree = &BTRFS_I(inode)->io_tree;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e8139c6..cf79f79 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7198,29 +7198,40 @@ unlock_err:
return ret;
 }
 
-static void btrfs_endio_direct_read(struct bio *bio, int err)
+static int btrfs_subio_endio_read(struct inode *inode,
+ struct btrfs_io_bio *io_bio)
 {
-   struct btrfs_dio_private *dip = bio->bi_private;
struct bio_vec *bvec;
-   struct inode *inode = dip->inode;
-   struct bio *dio_bio;
-   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
u64 start;
-   int ret;
int i;
+   int ret;
+   int err = 0;
 
-   if (err || (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-   goto skip_checksum;
+   if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
+   return 0;
 
-   start = dip->logical_offset;
-   bio_for_each_segment_all(bvec, bio, i) {
+   start = io_bio->logical;
+   bio_for_each_segment_all(bvec, &io_bio->bio, i) {
ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
 0, start, bvec->bv_len);
if (ret)
err = -EIO;
start += bvec->bv_len;
}
-skip_checksum:
+
+   return err;
+}
+
+static void btrfs_endio_direct_read(struct bio *bio, int err)
+{
+   struct btrfs_dio_private *dip = bio->bi_private;
+   struct inode *inode = dip->inode;
+   struct bio *dio_bio;
+   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+
+   if (!err && (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED))
+   err = btrfs_subio_endio_read(inode, io_bio);
+
unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
  dip->logical_offset + dip->bytes - 1);
dio_bio = dip->dio_bio;
@@ -7298,6 +7309,7 @@ static int __btrfs_submit_bio_start_direct_io(struct 
inode *inode, int rw,
 static void btrfs_end_dio_bio(struct bio *bio, int err)
 {
struct btrfs_dio_private *dip = bio->bi_private;
+   int ret;
 
if (err) {
btrfs_err(BTRFS_I(dip->inode)->root->fs_info,
@@ -7305,6 +7317,13 @@ static void btrfs_end_dio_bio(struct bio *bio, int err)
  btrfs_ino(dip->inode), bio->bi_rw,
  (unsigned long long)bio->bi_iter.bi_sector,
  bio->bi_iter.bi_size, err);
+

[PATCH v4 05/11] Btrfs: Cleanup unused variant and argument of IO failure handlers

2014-09-12 Thread Miao Xie

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f8dda46..154cb8e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1981,8 +1981,7 @@ struct io_failure_record {
int in_validation;
 };
 
-static int free_io_failure(struct inode *inode, struct io_failure_record *rec,
-   int did_repair)
+static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
int ret;
int err = 0;
@@ -2109,7 +2108,6 @@ static int clean_io_failure(u64 start, struct page *page)
struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
struct extent_state *state;
int num_copies;
-   int did_repair = 0;
int ret;
 
private = 0;
@@ -2130,7 +2128,6 @@ static int clean_io_failure(u64 start, struct page *page)
/* there was no real error, just free the record */
pr_debug("clean_io_failure: freeing dummy error at %llu\n",
 failrec->start);
-   did_repair = 1;
goto out;
}
if (fs_info->sb->s_flags & MS_RDONLY)
@@ -2147,19 +2144,16 @@ static int clean_io_failure(u64 start, struct page 
*page)
num_copies = btrfs_num_copies(fs_info, failrec->logical,
  failrec->len);
if (num_copies > 1)  {
-   ret = repair_io_failure(fs_info, start, failrec->len,
-   failrec->logical, page,
-   failrec->failed_mirror);
-   did_repair = !ret;
+   repair_io_failure(fs_info, start, failrec->len,
+ failrec->logical, page,
+ failrec->failed_mirror);
}
-   ret = 0;
}
 
 out:
-   if (!ret)
-   ret = free_io_failure(inode, failrec, did_repair);
+   free_io_failure(inode, failrec);
 
-   return ret;
+   return 0;
 }
 
 /*
@@ -2269,7 +2263,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
 */
pr_debug("bio_readpage_error: cannot repair, num_copies=%d, 
next_mirror %d, failed_mirror %d\n",
 num_copies, failrec->this_mirror, failed_mirror);
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
return -EIO;
}
 
@@ -2312,13 +2306,13 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
if (failrec->this_mirror > num_copies) {
pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror 
%d, failed_mirror %d\n",
 num_copies, failrec->this_mirror, failed_mirror);
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
return -EIO;
}
 
bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
if (!bio) {
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
return -EIO;
}
bio->bi_end_io = failed_bio->bi_end_io;
@@ -2349,7 +2343,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
 failrec->this_mirror,
 failrec->bio_flags, 0);
if (ret) {
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
bio_put(bio);
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 04/11] Btrfs: fix missing error handler if submiting re-read bio fails

2014-09-12 Thread Miao Xie

We forgot to free failure record and bio after submitting re-read bio failed,
fix it.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 92a6d9f..f8dda46 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2348,6 +2348,11 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
ret = tree->ops->submit_bio_hook(inode, read_mode, bio,
 failrec->this_mirror,
 failrec->bio_flags, 0);
+   if (ret) {
+   free_io_failure(inode, failrec, 0);
+   bio_put(bio);
+   }
+
return ret;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 00/11] Implement the data repair function for direct read

2014-09-12 Thread Miao Xie

This patchset implement the data repair function for the direct read, it
is implemented like buffered read:
1.When we find the data is not right, we try to read the data from the other
  mirror.
2.When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
3.If We get right data, we write it back to repair the corrupted mirror.
4.If the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
5.After the above work, we set the uptodate flag according to the result.

The difference is that the direct read may be splited to several small io,
in order to get the number of the mirror on which the io error happens. we
have to do data check and repair on the end IO function of those sub-IO
request.

Besides that, we also fixed some bugs of direct io.

Changelog v3 -> v4:
- Remove the 1st patch which has been applied into the upstream kernel.
- Use a dedicated btrfs workqueue instead of the system workqueue to
  deal with the completed repair bio, this suggest was from Chris.
- Rebase the patchset to integration branch of Chris's git tree.

Changelog v2 -> v3:
- Fix wrong returned bio when doing bio clone, which was reported by Filipe

Changelog v1 -> v2:
- Fix the warning which was triggered by __GFP_ZERO in the 2nd patch

Miao Xie (11):
  Btrfs: load checksum data once when submitting a direct read io
  Btrfs: cleanup similar code of the buffered data data check and dio
read data check
  Btrfs: do file data check by sub-bio's self
  Btrfs: fix missing error handler if submiting re-read bio fails
  Btrfs: Cleanup unused variant and argument of IO failure handlers
  Btrfs: split bio_readpage_error into several functions
  Btrfs: modify repair_io_failure and make it suit direct io
  Btrfs: modify clean_io_failure and make it suit direct io
  Btrfs: Set real mirror number for read operation on RAID0/5/6
  Btrfs: implement repair function when direct read fails
  Btrfs: cleanup the read failure record after write or when the inode
is freeing

 fs/btrfs/async-thread.c |   1 +
 fs/btrfs/async-thread.h |   1 +
 fs/btrfs/btrfs_inode.h  |  10 +-
 fs/btrfs/ctree.h|   4 +-
 fs/btrfs/disk-io.c  |  11 +-
 fs/btrfs/disk-io.h  |   1 +
 fs/btrfs/extent_io.c| 254 +--
 fs/btrfs/extent_io.h|  38 -
 fs/btrfs/file-item.c|  14 +-
 fs/btrfs/inode.c| 446 +++-
 fs/btrfs/scrub.c|   4 +-
 fs/btrfs/volumes.c  |   5 +
 fs/btrfs/volumes.h  |   5 +-
 13 files changed, 601 insertions(+), 193 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 07/11] Btrfs: modify repair_io_failure and make it suit direct io

2014-09-12 Thread Miao Xie

The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie 
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 8 +---
 fs/btrfs/extent_io.h | 2 +-
 fs/btrfs/scrub.c | 1 +
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cf1de40..9fbc005 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1997,7 +1997,7 @@ static int free_io_failure(struct inode *inode, struct 
io_failure_record *rec)
  */
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
u64 length, u64 logical, struct page *page,
-   int mirror_num)
+   unsigned int pg_offset, int mirror_num)
 {
struct bio *bio;
struct btrfs_device *dev;
@@ -2036,7 +2036,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 
start,
return -EIO;
}
bio->bi_bdev = dev->bdev;
-   bio_add_page(bio, page, length, start - page_offset(page));
+   bio_add_page(bio, page, length, pg_offset);
 
if (btrfsic_submit_bio_wait(WRITE_SYNC, bio)) {
/* try to remap that extent elsewhere? */
@@ -2067,7 +2067,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct 
extent_buffer *eb,
for (i = 0; i < num_pages; i++) {
struct page *p = extent_buffer_page(eb, i);
ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
-   start, p, mirror_num);
+   start, p, start - page_offset(p),
+   mirror_num);
if (ret)
break;
start += PAGE_CACHE_SIZE;
@@ -2127,6 +2128,7 @@ static int clean_io_failure(u64 start, struct page *page)
if (num_copies > 1)  {
repair_io_failure(fs_info, start, failrec->len,
  failrec->logical, page,
+ start - page_offset(page),
  failrec->failed_mirror);
}
}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 75b621b..a82ecbc 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -340,7 +340,7 @@ struct btrfs_fs_info;
 
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
u64 length, u64 logical, struct page *page,
-   int mirror_num);
+   unsigned int pg_offset, int mirror_num);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 int mirror_num);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index cce122b..3978529 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -682,6 +682,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 
root, void *fixup_ctx)
fs_info = BTRFS_I(inode)->root->fs_info;
ret = repair_io_failure(fs_info, offset, PAGE_SIZE,
fixup->logical, page,
+   offset - page_offset(page),
fixup->mirror_num);
unlock_page(page);
corrected = !ret;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] Btrfs: scan all the devices and build the fs device list by btrfs's self

2014-09-08 Thread Miao Xie

On Sat, 6 Sep 2014 13:48:09 +0200, Goffredo Baroncelli wrote:
> On 09/03/2014 03:36 PM, Miao Xie wrote:
>> The original code need scan the devices and build the fs device list by the 
>> user
>> tool by udev or users' selves. It is flexible. But if someone re-install the
>> filesystem module, and forget to scan the devices by himself, or we plug some
>> devices with btrfs, but udev thread is blocked and doesn't register the disk
>> into btrfs in time, the filesystem would report that "can not open some 
>> device"
>> when mounting the filesystem, it was uncomfortable, this patch fixes this 
>> problem
>> by scanning all the devices if we find the number of devices is not right 
>> when
>> we mount the filesystem.
>>
>> Signed-off-by: Miao Xie 
> []
>> +
>> +void btrfs_scan_all_devices(void *holder)
>> +{
>> +struct class_dev_iter iter;
>> +struct device *dev;
>> +struct gendisk *disk;
>> +
>> +mutex_lock(&uuid_mutex);
>> +class_dev_iter_init(&iter, &block_class, NULL, &disk_type);
>> +while ((dev = class_dev_iter_next(&iter))) {
>> +disk = dev_to_disk(dev);
>> +
>> +if (!get_capacity(disk) ||
>> +(!disk_max_parts(disk) &&
>> + (disk->flags & GENHD_FL_REMOVABLE)))
> ^^
>> +continue;
>> +
>> +if (disk->flags & GENHD_FL_SUPPRESS_PARTITION_INFO)
>> +continue;
> 
> 
> Hi, could you elaborate why a removable disk should be not scan-ned ? How
> a removble usb disk is classified ?

This is used to filter the non-partitionable removeable device such as cdrom,
if it is a usb disk, it should be partitionable.

Thanks
Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 18/18] Btrfs: modify rw_devices counter under chunk_mutex context

2014-09-03 Thread Miao Xie

rw_devices counter is often used to tune the profile when doing chunk 
allocation,
so we should modify it under the chunk_mutex context to avoid getting wrong
chunk profile.

Signed-off-by: Miao Xie 
---
 fs/btrfs/volumes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b7f093d..1aacf5f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1649,8 +1649,8 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
if (device->writeable) {
lock_chunks(root);
list_del_init(&device->dev_alloc_list);
+   device->fs_devices->rw_devices--;
unlock_chunks(root);
-   root->fs_info->fs_devices->rw_devices--;
clear_super = true;
}
 
@@ -1795,8 +1795,8 @@ error_undo:
lock_chunks(root);
list_add(&device->dev_alloc_list,
 &root->fs_info->fs_devices->alloc_list);
+   device->fs_devices->rw_devices++;
unlock_chunks(root);
-   root->fs_info->fs_devices->rw_devices++;
}
goto error_brelse;
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/18] Btrfs: fix unprotected device list access when cloning fs devices

2014-09-03 Thread Miao Xie

We can build a new filesystem based a seed filesystem, and we need clone
the fs devices when we open the new filesystem. But someone might clear
the seed flag of the seed filesystem, then mount that filesystem and
remove some device. If we mount the new filesystem, we might access
a device list which was being changed when we clone the fs devices.
Fix it.

Signed-off-by: Miao Xie 
---
 fs/btrfs/volumes.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 357f911..f0173b1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -583,6 +583,7 @@ static struct btrfs_fs_devices *clone_fs_devices(struct 
btrfs_fs_devices *orig)
if (IS_ERR(fs_devices))
return fs_devices;
 
+   mutex_lock(&orig->device_list_mutex);
fs_devices->total_devices = orig->total_devices;
 
/* We have held the volume lock, it is safe to get the devices. */
@@ -611,8 +612,10 @@ static struct btrfs_fs_devices *clone_fs_devices(struct 
btrfs_fs_devices *orig)
device->fs_devices = fs_devices;
fs_devices->num_devices++;
}
+   mutex_unlock(&orig->device_list_mutex);
return fs_devices;
 error:
+   mutex_unlock(&orig->device_list_mutex);
free_fs_devices(fs_devices);
return ERR_PTR(-ENOMEM);
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5] block: export disk_class and disk_type for btrfs

2014-09-03 Thread Miao Xie

Btrfs can make filesystem cross several disks/partitions, in order to
load all the disks/partitions which belong to the same filesystem, we
need scan the system and find all the devices, and then register them
into the kernel. Currently, we do it by user tool. But if we forget to
do it, we can not mount the filesystem. So I want btrfs scan the system
and find all the devices by itself in the kernel. In order to implement
it, we need disk_class and disk_type, so export them.

Signed-off-by: Miao Xie 
---
 block/genhd.c | 7 +--
 include/linux/genhd.h | 1 +
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 791f419..8371c09 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -34,7 +34,7 @@ struct kobject *block_depr;
 static DEFINE_MUTEX(ext_devt_mutex);
 static DEFINE_IDR(ext_devt_idr);
 
-static struct device_type disk_type;
+struct device_type disk_type;
 
 static void disk_check_events(struct disk_events *ev,
  unsigned int *clearing_ptr);
@@ -1107,9 +1107,11 @@ static void disk_release(struct device *dev)
blk_put_queue(disk->queue);
kfree(disk);
 }
+
 struct class block_class = {
.name   = "block",
 };
+EXPORT_SYMBOL(block_class);
 
 static char *block_devnode(struct device *dev, umode_t *mode,
   kuid_t *uid, kgid_t *gid)
@@ -1121,12 +1123,13 @@ static char *block_devnode(struct device *dev, umode_t 
*mode,
return NULL;
 }
 
-static struct device_type disk_type = {
+struct device_type disk_type = {
.name   = "disk",
.groups = disk_attr_groups,
.release= disk_release,
.devnode= block_devnode,
 };
+EXPORT_SYMBOL(disk_type);
 
 #ifdef CONFIG_PROC_FS
 /*
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index ec274e0..a701ace 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -22,6 +22,7 @@
 #define part_to_dev(part)  (&((part)->__dev))
 
 extern struct device_type part_type;
+extern struct device_type disk_type;
 extern struct kobject *block_depr;
 extern struct class block_class;
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 15/18] Btrfs: make the logic of source device removing more clear

2014-09-03 Thread Miao Xie

Signed-off-by: Miao Xie 
---
 fs/btrfs/dev-replace.c |  3 +--
 fs/btrfs/volumes.c | 19 +++
 2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index e9cbbdb..6f662b3 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -569,8 +569,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
if (fs_info->fs_devices->latest_bdev == src_device->bdev)
fs_info->fs_devices->latest_bdev = tgt_device->bdev;
list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
-   if (src_device->fs_devices->seeding)
-   fs_info->fs_devices->rw_devices++;
+   fs_info->fs_devices->rw_devices++;
 
/* replace the sysfs entry */
btrfs_kobj_rm_device(fs_info, src_device);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 24d7001..fd8141e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1819,23 +1819,18 @@ void btrfs_rm_dev_replace_srcdev(struct btrfs_fs_info 
*fs_info,
list_del_rcu(&srcdev->dev_list);
list_del_rcu(&srcdev->dev_alloc_list);
fs_devices->num_devices--;
-   if (srcdev->missing) {
+   if (srcdev->missing)
fs_devices->missing_devices--;
-   if (!fs_devices->seeding)
-   fs_devices->rw_devices++;
+
+   if (srcdev->writeable) {
+   fs_devices->rw_devices--;
+   /* zero out the old super if it is writable */
+   btrfs_scratch_superblock(srcdev);
}
 
-   if (srcdev->bdev) {
+   if (srcdev->bdev)
fs_devices->open_devices--;
 
-   /*
-* zero out the old super if it is not writable
-* (e.g. seed device)
-*/
-   if (srcdev->writeable)
-   btrfs_scratch_superblock(srcdev);
-   }
-
call_rcu(&srcdev->rcu, free_device);
 
/*
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1137 matches

Mail list logo