Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-19 Thread Jan Kara
Hi!

On Fri 19-01-18 15:03:30, piaojun wrote:
> On 2018/1/19 13:42, Changwei Ge wrote:
> > Hi Jun,
> > 
> > On 2018/1/19 11:59, piaojun wrote:
> >> Hi Changwei,
> >>
> >> On 2018/1/19 11:38, Changwei Ge wrote:
> >>> Hi Jun,
> >>>
> >>> On 2018/1/19 11:17, piaojun wrote:
>  Hi Jan, Eric and Changwei,
> 
>  Could we use another mutex lock to protect quota recovery? Sharing the
>  lock with VFS-layer probably seems a little weird.
> >>>
> >>> I am afraid that we can't since quota need ::s_umount and we indeed need
> >>> ::s_umount to get rid of race that quota has freed structs that will be
> >>> used by quota recovery in ocfs2.
> >>>
> >> Could you explain which 'structs' used by quota recovery? Do you mean
> >> 'struct super_block'?
> > I am not pointing to super_block.
> > 
> > Sure.
> > You can refer to
> > ocfs2_finish_quota_recovery
> >ocfs2_recover_local_quota_file -> here, operations on quota happens
> > 
> > Thanks,
> > Changwei
> > 
> I looked through the code in deactivate_locked_super(), and did not
> find any *structs* needed to be protected except 'sb'. In addition I

There are quota structures (generally attached to superblock in some way)
that get freed in ocfs2_local_free_info() that are used by quota recovery.
ocfs2_local_free_info() gets called when we call dquot_disable() from
ocfs2_disable_quotas() from ocfs2_dismount_volume(). That's why we need to
serialize quota recovery and umount.

> still could not figure out why 'dqonoff_mutex' was relaced by
> 's_umount'. At least, the patch 5f530de63cfc did not mention that. I
> will appreciate for detailed explaination.

It was replaced mostly to fix a lock inversion issue in generic code -
details are described in commit 9d1ccbe70e0b145 "quota: Use s_umount
protection for quota operations".

Looking into ocfs2 quota code, I think it will be easiest to introduce
ocfs2 variant of dqonoff mutex and use it for serialization of umount and
recovery. While reading the recovery / umount / remount code, I've also
found several other problems / races that also need fixing (e.g. if some
node mounts a filesystem in read-only mode and ends up doing recovery of
some crashed node, quota recovery will not work). So I plan to do some more
fixing in this area, just didn't get to it yet.

Honza

>  On 2018/1/19 9:48, Changwei Ge wrote:
> > Hi Jan,
> >
> > On 2018/1/18 0:03, Jan Kara wrote:
> >> On Wed 17-01-18 16:21:35, Jan Kara wrote:
> >>> Hello,
> >>>
> >>> On Fri 12-01-18 16:25:56, Eric Ren wrote:
>  On 01/12/2018 11:43 AM, Shichangkuo wrote:
> > Hi all,
> >   Now we are testing ocfs2 with 4.14 kernel, and we finding a 
> > deadlock with umount and ocfs2 workqueue triggered by ocfs2rec 
> > thread. The stack as follows:
> > journal recovery work:
> > [] call_rwsem_down_read_failed+0x14/0x30
> > [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
> > [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
> > [] process_one_work+0x130/0x350
> > [] worker_thread+0x46/0x3b0
> > [] kthread+0x101/0x140
> > [] ret_from_fork+0x1f/0x30
> > [] 0x
> >
> > /bin/umount:
> > [] flush_workqueue+0x104/0x3e0
> > [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
> > [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
> > [] ocfs2_put_super+0x31/0xa0 [ocfs2]
> > [] generic_shutdown_super+0x6d/0x120
> > [] kill_block_super+0x2d/0x60
> > [] deactivate_locked_super+0x51/0x90
> > [] cleanup_mnt+0x3b/0x70
> > [] task_work_run+0x86/0xa0
> > [] exit_to_usermode_loop+0x6d/0xa9
> > [] do_syscall_64+0x11d/0x130
> > [] entry_SYSCALL64_slow_path+0x25/0x25
> > [] 0x
> >   
> > Function ocfs2_finish_quota_recovery try to get sb->s_umount, which 
> > was already locked by umount thread, then get a deadlock.
> 
>  Good catch, thanks for reporting.  Is it reproducible? Can you 
>  please share
>  the steps for reproducing this issue?
> > This issue was introduced by 
> > c3b004460d77bf3f980d877be539016f2df4df12 and 
> > 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
> > I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex 
> > was already removed.
> > Shall we add a new mutex?
> 
>  @Jan, I don't look into the code yet, could you help me understand 
>  why we
>  need to get sb->s_umount in ocfs2_finish_quota_recovery?
>  Is it because that the quota recovery process will start at 
>  umounting? or
>  some where else?
> >>>
> >>> I was refreshing my memory wrt how ocfs2 quota recovery works. The 
> >>> problem
> >>> is the following: We load information about all

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-18 Thread piaojun
Hi Changwei,

On 2018/1/19 13:42, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/19 11:59, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/1/19 11:38, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2018/1/19 11:17, piaojun wrote:
 Hi Jan, Eric and Changwei,

 Could we use another mutex lock to protect quota recovery? Sharing the
 lock with VFS-layer probably seems a little weird.
>>>
>>> I am afraid that we can't since quota need ::s_umount and we indeed need
>>> ::s_umount to get rid of race that quota has freed structs that will be
>>> used by quota recovery in ocfs2.
>>>
>> Could you explain which 'structs' used by quota recovery? Do you mean
>> 'struct super_block'?
> I am not pointing to super_block.
> 
> Sure.
> You can refer to
> ocfs2_finish_quota_recovery
>ocfs2_recover_local_quota_file -> here, operations on quota happens
> 
> Thanks,
> Changwei
> 
I looked through the code in deactivate_locked_super(), and did not
find any *structs* needed to be protected except 'sb'. In addition I
still could not figure out why 'dqonoff_mutex' was relaced by
's_umount'. At least, the patch 5f530de63cfc did not mention that. I
will appreciate for detailed explaination.

""
ocfs2: Use s_umount for quota recovery protection

Currently we use dqonoff_mutex to serialize quota recovery protection
and turning of quotas on / off. Use s_umount semaphore instead.
""

thanks,
Jun

>>
>> thanks,
>> Jun
>>

 On 2018/1/19 9:48, Changwei Ge wrote:
> Hi Jan,
>
> On 2018/1/18 0:03, Jan Kara wrote:
>> On Wed 17-01-18 16:21:35, Jan Kara wrote:
>>> Hello,
>>>
>>> On Fri 12-01-18 16:25:56, Eric Ren wrote:
 On 01/12/2018 11:43 AM, Shichangkuo wrote:
> Hi all,
>   Now we are testing ocfs2 with 4.14 kernel, and we finding a 
> deadlock with umount and ocfs2 workqueue triggered by ocfs2rec 
> thread. The stack as follows:
> journal recovery work:
> [] call_rwsem_down_read_failed+0x14/0x30
> [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
> [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
> [] process_one_work+0x130/0x350
> [] worker_thread+0x46/0x3b0
> [] kthread+0x101/0x140
> [] ret_from_fork+0x1f/0x30
> [] 0x
>
> /bin/umount:
> [] flush_workqueue+0x104/0x3e0
> [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
> [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
> [] ocfs2_put_super+0x31/0xa0 [ocfs2]
> [] generic_shutdown_super+0x6d/0x120
> [] kill_block_super+0x2d/0x60
> [] deactivate_locked_super+0x51/0x90
> [] cleanup_mnt+0x3b/0x70
> [] task_work_run+0x86/0xa0
> [] exit_to_usermode_loop+0x6d/0xa9
> [] do_syscall_64+0x11d/0x130
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x
>   
> Function ocfs2_finish_quota_recovery try to get sb->s_umount, which 
> was already locked by umount thread, then get a deadlock.

 Good catch, thanks for reporting.  Is it reproducible? Can you please 
 share
 the steps for reproducing this issue?
> This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 
> and 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
> I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
> already removed.
> Shall we add a new mutex?

 @Jan, I don't look into the code yet, could you help me understand why 
 we
 need to get sb->s_umount in ocfs2_finish_quota_recovery?
 Is it because that the quota recovery process will start at umounting? 
 or
 some where else?
>>>
>>> I was refreshing my memory wrt how ocfs2 quota recovery works. The 
>>> problem
>>> is the following: We load information about all quota information that
>>> needs recovering (this is possibly for other nodes) in
>>> ocfs2_begin_quota_recovery() that gets called during mount. Real quota
>>> recovery happens from the recovery thread in 
>>> ocfs2_finish_quota_recovery().
>>> We need to protect code running there from dquot_disable() calls as that
>>> will free structures we use for updating quota information etc. 
>>> Currently
>>> we use sb->s_umount for that protection.
>>>
>>> The problem above apparently happens when someone calls umount before 
>>> the
>>> recovery thread can finish quota recovery. I will think more about how 
>>> to
>>> fix the locking so that this lock inversion does not happen...
>>
>> So could we move ocfs2_recovery_exit() call in ocfs2_dismount_volume() up
>> before ocfs2_disable_quotas()? It seems possible to me, I'm just not sure
>> if there are not some hidden dependencies on recovery being shut down 
>> only
>> after truncate log / local alloc. If we can do that, we could r

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-18 Thread Changwei Ge
Hi Jun,

On 2018/1/19 11:59, piaojun wrote:
> Hi Changwei,
> 
> On 2018/1/19 11:38, Changwei Ge wrote:
>> Hi Jun,
>>
>> On 2018/1/19 11:17, piaojun wrote:
>>> Hi Jan, Eric and Changwei,
>>>
>>> Could we use another mutex lock to protect quota recovery? Sharing the
>>> lock with VFS-layer probably seems a little weird.
>>
>> I am afraid that we can't since quota need ::s_umount and we indeed need
>> ::s_umount to get rid of race that quota has freed structs that will be
>> used by quota recovery in ocfs2.
>>
> Could you explain which 'structs' used by quota recovery? Do you mean
> 'struct super_block'?
I am not pointing to super_block.

Sure.
You can refer to
ocfs2_finish_quota_recovery
   ocfs2_recover_local_quota_file -> here, operations on quota happens

Thanks,
Changwei

> 
> thanks,
> Jun
> 
>>>
>>> On 2018/1/19 9:48, Changwei Ge wrote:
 Hi Jan,

 On 2018/1/18 0:03, Jan Kara wrote:
> On Wed 17-01-18 16:21:35, Jan Kara wrote:
>> Hello,
>>
>> On Fri 12-01-18 16:25:56, Eric Ren wrote:
>>> On 01/12/2018 11:43 AM, Shichangkuo wrote:
 Hi all,
   Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock 
 with umount and ocfs2 workqueue triggered by ocfs2rec thread. The 
 stack as follows:
 journal recovery work:
 [] call_rwsem_down_read_failed+0x14/0x30
 [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
 [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
 [] process_one_work+0x130/0x350
 [] worker_thread+0x46/0x3b0
 [] kthread+0x101/0x140
 [] ret_from_fork+0x1f/0x30
 [] 0x

 /bin/umount:
 [] flush_workqueue+0x104/0x3e0
 [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
 [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
 [] ocfs2_put_super+0x31/0xa0 [ocfs2]
 [] generic_shutdown_super+0x6d/0x120
 [] kill_block_super+0x2d/0x60
 [] deactivate_locked_super+0x51/0x90
 [] cleanup_mnt+0x3b/0x70
 [] task_work_run+0x86/0xa0
 [] exit_to_usermode_loop+0x6d/0xa9
 [] do_syscall_64+0x11d/0x130
 [] entry_SYSCALL64_slow_path+0x25/0x25
 [] 0x
   
 Function ocfs2_finish_quota_recovery try to get sb->s_umount, which 
 was already locked by umount thread, then get a deadlock.
>>>
>>> Good catch, thanks for reporting.  Is it reproducible? Can you please 
>>> share
>>> the steps for reproducing this issue?
 This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 
 and 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
 I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
 already removed.
 Shall we add a new mutex?
>>>
>>> @Jan, I don't look into the code yet, could you help me understand why 
>>> we
>>> need to get sb->s_umount in ocfs2_finish_quota_recovery?
>>> Is it because that the quota recovery process will start at umounting? 
>>> or
>>> some where else?
>>
>> I was refreshing my memory wrt how ocfs2 quota recovery works. The 
>> problem
>> is the following: We load information about all quota information that
>> needs recovering (this is possibly for other nodes) in
>> ocfs2_begin_quota_recovery() that gets called during mount. Real quota
>> recovery happens from the recovery thread in 
>> ocfs2_finish_quota_recovery().
>> We need to protect code running there from dquot_disable() calls as that
>> will free structures we use for updating quota information etc. Currently
>> we use sb->s_umount for that protection.
>>
>> The problem above apparently happens when someone calls umount before the
>> recovery thread can finish quota recovery. I will think more about how to
>> fix the locking so that this lock inversion does not happen...
>
> So could we move ocfs2_recovery_exit() call in ocfs2_dismount_volume() up
> before ocfs2_disable_quotas()? It seems possible to me, I'm just not sure
> if there are not some hidden dependencies on recovery being shut down only
> after truncate log / local alloc. If we can do that, we could remove
> s_umount protection from ocfs2_finish_quota_recovery() and thus resolve 
> the
> race.
>
>   Honza

 Thanks for looking into this.
 I am not quite familiar with quota part.:)

 Or can we move ocfs2_disable_quotas() in ocfs2_dismount_volume down
 after ocfs2_recovery_exit() with ::invoking down_read(&sb->s_umount)
 eliminated?

 Another way I can figure out is:
 I think we might get inspired from qsync_work_fn().
 In that function if current work is under running context of umount with
 ::s_umount held, it just delays current work to next time.

 So can we also _try lock_ in ocfs2

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-18 Thread piaojun
Hi Changwei,

On 2018/1/19 11:38, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/19 11:17, piaojun wrote:
>> Hi Jan, Eric and Changwei,
>>
>> Could we use another mutex lock to protect quota recovery? Sharing the
>> lock with VFS-layer probably seems a little weird.
> 
> I am afraid that we can't since quota need ::s_umount and we indeed need 
> ::s_umount to get rid of race that quota has freed structs that will be 
> used by quota recovery in ocfs2.
> 
Could you explain which 'structs' used by quota recovery? Do you mean
'struct super_block'?

thanks,
Jun

>>
>> On 2018/1/19 9:48, Changwei Ge wrote:
>>> Hi Jan,
>>>
>>> On 2018/1/18 0:03, Jan Kara wrote:
 On Wed 17-01-18 16:21:35, Jan Kara wrote:
> Hello,
>
> On Fri 12-01-18 16:25:56, Eric Ren wrote:
>> On 01/12/2018 11:43 AM, Shichangkuo wrote:
>>> Hi all,
>>>   Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock 
>>> with umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack 
>>> as follows:
>>> journal recovery work:
>>> [] call_rwsem_down_read_failed+0x14/0x30
>>> [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
>>> [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
>>> [] process_one_work+0x130/0x350
>>> [] worker_thread+0x46/0x3b0
>>> [] kthread+0x101/0x140
>>> [] ret_from_fork+0x1f/0x30
>>> [] 0x
>>>
>>> /bin/umount:
>>> [] flush_workqueue+0x104/0x3e0
>>> [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
>>> [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
>>> [] ocfs2_put_super+0x31/0xa0 [ocfs2]
>>> [] generic_shutdown_super+0x6d/0x120
>>> [] kill_block_super+0x2d/0x60
>>> [] deactivate_locked_super+0x51/0x90
>>> [] cleanup_mnt+0x3b/0x70
>>> [] task_work_run+0x86/0xa0
>>> [] exit_to_usermode_loop+0x6d/0xa9
>>> [] do_syscall_64+0x11d/0x130
>>> [] entry_SYSCALL64_slow_path+0x25/0x25
>>> [] 0x
>>>   
>>> Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was 
>>> already locked by umount thread, then get a deadlock.
>>
>> Good catch, thanks for reporting.  Is it reproducible? Can you please 
>> share
>> the steps for reproducing this issue?
>>> This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 
>>> and 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
>>> I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
>>> already removed.
>>> Shall we add a new mutex?
>>
>> @Jan, I don't look into the code yet, could you help me understand why we
>> need to get sb->s_umount in ocfs2_finish_quota_recovery?
>> Is it because that the quota recovery process will start at umounting? or
>> some where else?
>
> I was refreshing my memory wrt how ocfs2 quota recovery works. The problem
> is the following: We load information about all quota information that
> needs recovering (this is possibly for other nodes) in
> ocfs2_begin_quota_recovery() that gets called during mount. Real quota
> recovery happens from the recovery thread in 
> ocfs2_finish_quota_recovery().
> We need to protect code running there from dquot_disable() calls as that
> will free structures we use for updating quota information etc. Currently
> we use sb->s_umount for that protection.
>
> The problem above apparently happens when someone calls umount before the
> recovery thread can finish quota recovery. I will think more about how to
> fix the locking so that this lock inversion does not happen...

 So could we move ocfs2_recovery_exit() call in ocfs2_dismount_volume() up
 before ocfs2_disable_quotas()? It seems possible to me, I'm just not sure
 if there are not some hidden dependencies on recovery being shut down only
 after truncate log / local alloc. If we can do that, we could remove
 s_umount protection from ocfs2_finish_quota_recovery() and thus resolve the
 race.

Honza
>>>
>>> Thanks for looking into this.
>>> I am not quite familiar with quota part.:)
>>>
>>> Or can we move ocfs2_disable_quotas() in ocfs2_dismount_volume down
>>> after ocfs2_recovery_exit() with ::invoking down_read(&sb->s_umount)
>>> eliminated?
>>>
>>> Another way I can figure out is:
>>> I think we might get inspired from qsync_work_fn().
>>> In that function if current work is under running context of umount with
>>> ::s_umount held, it just delays current work to next time.
>>>
>>> So can we also _try lock_ in ocfs2_finish_quota_recovery() and recover
>>> quota by other ocfs2 cluster member nodes or local node's next time of
>>> mount?
>>>
>> I guess we need analyse the impact of _try lock_. Such as no other node
>> will help recovering quota when I'm the only node in cluster.
> 
> I don't see any risk for now. I will think about it more, later.
> 
> Thanks,
> Changwei
>>
>> th

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-18 Thread Changwei Ge
Hi Jun,

On 2018/1/19 11:17, piaojun wrote:
> Hi Jan, Eric and Changwei,
> 
> Could we use another mutex lock to protect quota recovery? Sharing the
> lock with VFS-layer probably seems a little weird.

I am afraid that we can't since quota need ::s_umount and we indeed need 
::s_umount to get rid of race that quota has freed structs that will be 
used by quota recovery in ocfs2.

> 
> On 2018/1/19 9:48, Changwei Ge wrote:
>> Hi Jan,
>>
>> On 2018/1/18 0:03, Jan Kara wrote:
>>> On Wed 17-01-18 16:21:35, Jan Kara wrote:
 Hello,

 On Fri 12-01-18 16:25:56, Eric Ren wrote:
> On 01/12/2018 11:43 AM, Shichangkuo wrote:
>> Hi all,
>>   Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock 
>> with umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack 
>> as follows:
>> journal recovery work:
>> [] call_rwsem_down_read_failed+0x14/0x30
>> [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
>> [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
>> [] process_one_work+0x130/0x350
>> [] worker_thread+0x46/0x3b0
>> [] kthread+0x101/0x140
>> [] ret_from_fork+0x1f/0x30
>> [] 0x
>>
>> /bin/umount:
>> [] flush_workqueue+0x104/0x3e0
>> [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
>> [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
>> [] ocfs2_put_super+0x31/0xa0 [ocfs2]
>> [] generic_shutdown_super+0x6d/0x120
>> [] kill_block_super+0x2d/0x60
>> [] deactivate_locked_super+0x51/0x90
>> [] cleanup_mnt+0x3b/0x70
>> [] task_work_run+0x86/0xa0
>> [] exit_to_usermode_loop+0x6d/0xa9
>> [] do_syscall_64+0x11d/0x130
>> [] entry_SYSCALL64_slow_path+0x25/0x25
>> [] 0x
>>   
>> Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was 
>> already locked by umount thread, then get a deadlock.
>
> Good catch, thanks for reporting.  Is it reproducible? Can you please 
> share
> the steps for reproducing this issue?
>> This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 
>> and 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
>> I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
>> already removed.
>> Shall we add a new mutex?
>
> @Jan, I don't look into the code yet, could you help me understand why we
> need to get sb->s_umount in ocfs2_finish_quota_recovery?
> Is it because that the quota recovery process will start at umounting? or
> some where else?

 I was refreshing my memory wrt how ocfs2 quota recovery works. The problem
 is the following: We load information about all quota information that
 needs recovering (this is possibly for other nodes) in
 ocfs2_begin_quota_recovery() that gets called during mount. Real quota
 recovery happens from the recovery thread in ocfs2_finish_quota_recovery().
 We need to protect code running there from dquot_disable() calls as that
 will free structures we use for updating quota information etc. Currently
 we use sb->s_umount for that protection.

 The problem above apparently happens when someone calls umount before the
 recovery thread can finish quota recovery. I will think more about how to
 fix the locking so that this lock inversion does not happen...
>>>
>>> So could we move ocfs2_recovery_exit() call in ocfs2_dismount_volume() up
>>> before ocfs2_disable_quotas()? It seems possible to me, I'm just not sure
>>> if there are not some hidden dependencies on recovery being shut down only
>>> after truncate log / local alloc. If we can do that, we could remove
>>> s_umount protection from ocfs2_finish_quota_recovery() and thus resolve the
>>> race.
>>>
>>> Honza
>>
>> Thanks for looking into this.
>> I am not quite familiar with quota part.:)
>>
>> Or can we move ocfs2_disable_quotas() in ocfs2_dismount_volume down
>> after ocfs2_recovery_exit() with ::invoking down_read(&sb->s_umount)
>> eliminated?
>>
>> Another way I can figure out is:
>> I think we might get inspired from qsync_work_fn().
>> In that function if current work is under running context of umount with
>> ::s_umount held, it just delays current work to next time.
>>
>> So can we also _try lock_ in ocfs2_finish_quota_recovery() and recover
>> quota by other ocfs2 cluster member nodes or local node's next time of
>> mount?
>>
> I guess we need analyse the impact of _try lock_. Such as no other node
> will help recovering quota when I'm the only node in cluster.

I don't see any risk for now. I will think about it more, later.

Thanks,
Changwei
> 
> thanks,
> Jun
> 
>> Thanks,
>> Changwei
>>
>>
>>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
> 

___
Ocfs2-devel mailing lis

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-18 Thread piaojun
Hi Jan, Eric and Changwei,

Could we use another mutex lock to protect quota recovery? Sharing the
lock with VFS-layer probably seems a little weird.

On 2018/1/19 9:48, Changwei Ge wrote:
> Hi Jan,
> 
> On 2018/1/18 0:03, Jan Kara wrote:
>> On Wed 17-01-18 16:21:35, Jan Kara wrote:
>>> Hello,
>>>
>>> On Fri 12-01-18 16:25:56, Eric Ren wrote:
 On 01/12/2018 11:43 AM, Shichangkuo wrote:
> Hi all,
>   Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock 
> with umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack 
> as follows:
> journal recovery work:
> [] call_rwsem_down_read_failed+0x14/0x30
> [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
> [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
> [] process_one_work+0x130/0x350
> [] worker_thread+0x46/0x3b0
> [] kthread+0x101/0x140
> [] ret_from_fork+0x1f/0x30
> [] 0x
>
> /bin/umount:
> [] flush_workqueue+0x104/0x3e0
> [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
> [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
> [] ocfs2_put_super+0x31/0xa0 [ocfs2]
> [] generic_shutdown_super+0x6d/0x120
> [] kill_block_super+0x2d/0x60
> [] deactivate_locked_super+0x51/0x90
> [] cleanup_mnt+0x3b/0x70
> [] task_work_run+0x86/0xa0
> [] exit_to_usermode_loop+0x6d/0xa9
> [] do_syscall_64+0x11d/0x130
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x
>   
> Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was 
> already locked by umount thread, then get a deadlock.

 Good catch, thanks for reporting.  Is it reproducible? Can you please share
 the steps for reproducing this issue?
> This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 and 
> 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
> I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
> already removed.
> Shall we add a new mutex?

 @Jan, I don't look into the code yet, could you help me understand why we
 need to get sb->s_umount in ocfs2_finish_quota_recovery?
 Is it because that the quota recovery process will start at umounting? or
 some where else?
>>>
>>> I was refreshing my memory wrt how ocfs2 quota recovery works. The problem
>>> is the following: We load information about all quota information that
>>> needs recovering (this is possibly for other nodes) in
>>> ocfs2_begin_quota_recovery() that gets called during mount. Real quota
>>> recovery happens from the recovery thread in ocfs2_finish_quota_recovery().
>>> We need to protect code running there from dquot_disable() calls as that
>>> will free structures we use for updating quota information etc. Currently
>>> we use sb->s_umount for that protection.
>>>
>>> The problem above apparently happens when someone calls umount before the
>>> recovery thread can finish quota recovery. I will think more about how to
>>> fix the locking so that this lock inversion does not happen...
>>
>> So could we move ocfs2_recovery_exit() call in ocfs2_dismount_volume() up
>> before ocfs2_disable_quotas()? It seems possible to me, I'm just not sure
>> if there are not some hidden dependencies on recovery being shut down only
>> after truncate log / local alloc. If we can do that, we could remove
>> s_umount protection from ocfs2_finish_quota_recovery() and thus resolve the
>> race.
>>
>>  Honza
> 
> Thanks for looking into this.
> I am not quite familiar with quota part.:)
> 
> Or can we move ocfs2_disable_quotas() in ocfs2_dismount_volume down 
> after ocfs2_recovery_exit() with ::invoking down_read(&sb->s_umount) 
> eliminated?
> 
> Another way I can figure out is:
> I think we might get inspired from qsync_work_fn().
> In that function if current work is under running context of umount with 
> ::s_umount held, it just delays current work to next time.
> 
> So can we also _try lock_ in ocfs2_finish_quota_recovery() and recover 
> quota by other ocfs2 cluster member nodes or local node's next time of 
> mount?
> 
I guess we need analyse the impact of _try lock_. Such as no other node
will help recovering quota when I'm the only node in cluster.

thanks,
Jun

> Thanks,
> Changwei
> 
> 
>>
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-18 Thread Changwei Ge
Hi Jan,

On 2018/1/18 0:03, Jan Kara wrote:
> On Wed 17-01-18 16:21:35, Jan Kara wrote:
>> Hello,
>>
>> On Fri 12-01-18 16:25:56, Eric Ren wrote:
>>> On 01/12/2018 11:43 AM, Shichangkuo wrote:
 Hi all,
   Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock 
 with umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack as 
 follows:
 journal recovery work:
 [] call_rwsem_down_read_failed+0x14/0x30
 [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
 [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
 [] process_one_work+0x130/0x350
 [] worker_thread+0x46/0x3b0
 [] kthread+0x101/0x140
 [] ret_from_fork+0x1f/0x30
 [] 0x

 /bin/umount:
 [] flush_workqueue+0x104/0x3e0
 [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
 [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
 [] ocfs2_put_super+0x31/0xa0 [ocfs2]
 [] generic_shutdown_super+0x6d/0x120
 [] kill_block_super+0x2d/0x60
 [] deactivate_locked_super+0x51/0x90
 [] cleanup_mnt+0x3b/0x70
 [] task_work_run+0x86/0xa0
 [] exit_to_usermode_loop+0x6d/0xa9
 [] do_syscall_64+0x11d/0x130
 [] entry_SYSCALL64_slow_path+0x25/0x25
 [] 0x
   
 Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was 
 already locked by umount thread, then get a deadlock.
>>>
>>> Good catch, thanks for reporting.  Is it reproducible? Can you please share
>>> the steps for reproducing this issue?
 This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 and 
 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
 I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
 already removed.
 Shall we add a new mutex?
>>>
>>> @Jan, I don't look into the code yet, could you help me understand why we
>>> need to get sb->s_umount in ocfs2_finish_quota_recovery?
>>> Is it because that the quota recovery process will start at umounting? or
>>> some where else?
>>
>> I was refreshing my memory wrt how ocfs2 quota recovery works. The problem
>> is the following: We load information about all quota information that
>> needs recovering (this is possibly for other nodes) in
>> ocfs2_begin_quota_recovery() that gets called during mount. Real quota
>> recovery happens from the recovery thread in ocfs2_finish_quota_recovery().
>> We need to protect code running there from dquot_disable() calls as that
>> will free structures we use for updating quota information etc. Currently
>> we use sb->s_umount for that protection.
>>
>> The problem above apparently happens when someone calls umount before the
>> recovery thread can finish quota recovery. I will think more about how to
>> fix the locking so that this lock inversion does not happen...
> 
> So could we move ocfs2_recovery_exit() call in ocfs2_dismount_volume() up
> before ocfs2_disable_quotas()? It seems possible to me, I'm just not sure
> if there are not some hidden dependencies on recovery being shut down only
> after truncate log / local alloc. If we can do that, we could remove
> s_umount protection from ocfs2_finish_quota_recovery() and thus resolve the
> race.
> 
>   Honza

Thanks for looking into this.
I am not quite familiar with quota part.:)

Or can we move ocfs2_disable_quotas() in ocfs2_dismount_volume down 
after ocfs2_recovery_exit() with ::invoking down_read(&sb->s_umount) 
eliminated?

Another way I can figure out is:
I think we might get inspired from qsync_work_fn().
In that function if current work is under running context of umount with 
::s_umount held, it just delays current work to next time.

So can we also _try lock_ in ocfs2_finish_quota_recovery() and recover 
quota by other ocfs2 cluster member nodes or local node's next time of 
mount?

Thanks,
Changwei


> 
___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-17 Thread Jan Kara
On Wed 17-01-18 16:21:35, Jan Kara wrote:
> Hello,
> 
> On Fri 12-01-18 16:25:56, Eric Ren wrote:
> > On 01/12/2018 11:43 AM, Shichangkuo wrote:
> > > Hi all,
> > >   Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock 
> > > with umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack 
> > > as follows:
> > > journal recovery work:
> > > [] call_rwsem_down_read_failed+0x14/0x30
> > > [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
> > > [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
> > > [] process_one_work+0x130/0x350
> > > [] worker_thread+0x46/0x3b0
> > > [] kthread+0x101/0x140
> > > [] ret_from_fork+0x1f/0x30
> > > [] 0x
> > > 
> > > /bin/umount:
> > > [] flush_workqueue+0x104/0x3e0
> > > [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
> > > [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
> > > [] ocfs2_put_super+0x31/0xa0 [ocfs2]
> > > [] generic_shutdown_super+0x6d/0x120
> > > [] kill_block_super+0x2d/0x60
> > > [] deactivate_locked_super+0x51/0x90
> > > [] cleanup_mnt+0x3b/0x70
> > > [] task_work_run+0x86/0xa0
> > > [] exit_to_usermode_loop+0x6d/0xa9
> > > [] do_syscall_64+0x11d/0x130
> > > [] entry_SYSCALL64_slow_path+0x25/0x25
> > > [] 0x
> > >   
> > > Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was 
> > > already locked by umount thread, then get a deadlock.
> > 
> > Good catch, thanks for reporting.  Is it reproducible? Can you please share
> > the steps for reproducing this issue?
> > > This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 and 
> > > 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
> > > I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
> > > already removed.
> > > Shall we add a new mutex?
> > 
> > @Jan, I don't look into the code yet, could you help me understand why we
> > need to get sb->s_umount in ocfs2_finish_quota_recovery?
> > Is it because that the quota recovery process will start at umounting? or
> > some where else?
> 
> I was refreshing my memory wrt how ocfs2 quota recovery works. The problem
> is the following: We load information about all quota information that
> needs recovering (this is possibly for other nodes) in
> ocfs2_begin_quota_recovery() that gets called during mount. Real quota
> recovery happens from the recovery thread in ocfs2_finish_quota_recovery().
> We need to protect code running there from dquot_disable() calls as that
> will free structures we use for updating quota information etc. Currently
> we use sb->s_umount for that protection.
> 
> The problem above apparently happens when someone calls umount before the
> recovery thread can finish quota recovery. I will think more about how to
> fix the locking so that this lock inversion does not happen...

So could we move ocfs2_recovery_exit() call in ocfs2_dismount_volume() up
before ocfs2_disable_quotas()? It seems possible to me, I'm just not sure
if there are not some hidden dependencies on recovery being shut down only
after truncate log / local alloc. If we can do that, we could remove
s_umount protection from ocfs2_finish_quota_recovery() and thus resolve the
race.

Honza
-- 
Jan Kara 
SUSE Labs, CR

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-17 Thread Jan Kara
Hello,

On Fri 12-01-18 16:25:56, Eric Ren wrote:
> On 01/12/2018 11:43 AM, Shichangkuo wrote:
> > Hi all,
> >   Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock with 
> > umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack as 
> > follows:
> > journal recovery work:
> > [] call_rwsem_down_read_failed+0x14/0x30
> > [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
> > [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
> > [] process_one_work+0x130/0x350
> > [] worker_thread+0x46/0x3b0
> > [] kthread+0x101/0x140
> > [] ret_from_fork+0x1f/0x30
> > [] 0x
> > 
> > /bin/umount:
> > [] flush_workqueue+0x104/0x3e0
> > [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
> > [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
> > [] ocfs2_put_super+0x31/0xa0 [ocfs2]
> > [] generic_shutdown_super+0x6d/0x120
> > [] kill_block_super+0x2d/0x60
> > [] deactivate_locked_super+0x51/0x90
> > [] cleanup_mnt+0x3b/0x70
> > [] task_work_run+0x86/0xa0
> > [] exit_to_usermode_loop+0x6d/0xa9
> > [] do_syscall_64+0x11d/0x130
> > [] entry_SYSCALL64_slow_path+0x25/0x25
> > [] 0x
> >   
> > Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was 
> > already locked by umount thread, then get a deadlock.
> 
> Good catch, thanks for reporting.  Is it reproducible? Can you please share
> the steps for reproducing this issue?
> > This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 and 
> > 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
> > I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
> > already removed.
> > Shall we add a new mutex?
> 
> @Jan, I don't look into the code yet, could you help me understand why we
> need to get sb->s_umount in ocfs2_finish_quota_recovery?
> Is it because that the quota recovery process will start at umounting? or
> some where else?

I was refreshing my memory wrt how ocfs2 quota recovery works. The problem
is the following: We load information about all quota information that
needs recovering (this is possibly for other nodes) in
ocfs2_begin_quota_recovery() that gets called during mount. Real quota
recovery happens from the recovery thread in ocfs2_finish_quota_recovery().
We need to protect code running there from dquot_disable() calls as that
will free structures we use for updating quota information etc. Currently
we use sb->s_umount for that protection.

The problem above apparently happens when someone calls umount before the
recovery thread can finish quota recovery. I will think more about how to
fix the locking so that this lock inversion does not happen...

Honza
-- 
Jan Kara 
SUSE Labs, CR

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-12 Thread Eric Ren
Hi,

On 01/12/2018 11:43 AM, Shichangkuo wrote:
> Hi all,
>   Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock with 
> umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack as follows:
> journal recovery work:
> [] call_rwsem_down_read_failed+0x14/0x30
> [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
> [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
> [] process_one_work+0x130/0x350
> [] worker_thread+0x46/0x3b0
> [] kthread+0x101/0x140
> [] ret_from_fork+0x1f/0x30
> [] 0x
>
> /bin/umount:
> [] flush_workqueue+0x104/0x3e0
> [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
> [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
> [] ocfs2_put_super+0x31/0xa0 [ocfs2]
> [] generic_shutdown_super+0x6d/0x120
> [] kill_block_super+0x2d/0x60
> [] deactivate_locked_super+0x51/0x90
> [] cleanup_mnt+0x3b/0x70
> [] task_work_run+0x86/0xa0
> [] exit_to_usermode_loop+0x6d/0xa9
> [] do_syscall_64+0x11d/0x130
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x
>   
> Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was 
> already locked by umount thread, then get a deadlock.

Good catch, thanks for reporting.  Is it reproducible? Can you please 
share the steps for reproducing this issue?
> This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 and 
> 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
> I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was already 
> removed.
> Shall we add a new mutex?

@Jan, I don't look into the code yet, could you help me understand why 
we need to get sb->s_umount in ocfs2_finish_quota_recovery?
Is it because that the quota recovery process will start at umounting? 
or some where else?

Thanks,
Eric



___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-11 Thread Shichangkuo
Hi Joseph
Thanks for replying.
Umount will flush the ocfs2 workqueue in function 
ocfs2_truncate_log_shutdown and journal recovery is one work of ocfs2 wq.

Thanks
Changkuo

> -Original Message-
> From: Joseph Qi [mailto:jiangqi...@gmail.com]
> Sent: Friday, January 12, 2018 1:51 PM
> To: shichangkuo (Cloud); z...@suse.com; j...@suse.cz
> Cc: ocfs2-devel@oss.oracle.com
> Subject: Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2
> workqueue triggered by ocfs2rec thread
> 
> Hi Changkuo,
> 
> You said s_umount was acquired by umount and ocfs2rec was blocked when
> acquiring it. But you didn't describe why umount was blocked.
> 
> Thanks,
> Joseph
> 
> On 18/1/12 11:43, Shichangkuo wrote:
> > Hi all,
> >   Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock
> with umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack as
> follows:
> > journal recovery work:
> > [] call_rwsem_down_read_failed+0x14/0x30
> > [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
> > [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
> > [] process_one_work+0x130/0x350 []
> > worker_thread+0x46/0x3b0 [] kthread+0x101/0x140
> > [] ret_from_fork+0x1f/0x30 []
> > 0x
> >
> > /bin/umount:
> > [] flush_workqueue+0x104/0x3e0 []
> > ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2] []
> > ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2] []
> > ocfs2_put_super+0x31/0xa0 [ocfs2] []
> > generic_shutdown_super+0x6d/0x120 []
> > kill_block_super+0x2d/0x60 []
> > deactivate_locked_super+0x51/0x90 []
> > cleanup_mnt+0x3b/0x70 [] task_work_run+0x86/0xa0
> > [] exit_to_usermode_loop+0x6d/0xa9
> > [] do_syscall_64+0x11d/0x130 []
> > entry_SYSCALL64_slow_path+0x25/0x25
> > [] 0x
> >
> > Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was
> already locked by umount thread, then get a deadlock.
> > This issue was introduced by
> c3b004460d77bf3f980d877be539016f2df4df12 and
> 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
> > I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was
> already removed.
> > Shall we add a new mutex?
> >
> > Thanks
> > Changkuo
> > --
> > ---
> > 本邮件及其附件含有新华三技术有限公司的保密信息,仅限于发送给上
> 面地址中列出
> > 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或
> 部分地泄露、复制、
> > 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件
> 通知发件人并删除本
> > 邮件!
> > This e-mail and its attachments contain confidential information from
> > New H3C, which is intended only for the person or entity whose address
> > is listed above. Any use of the information contained herein in any
> > way (including, but not limited to, total or partial disclosure,
> > reproduction, or dissemination) by persons other than the intended
> > recipient(s) is prohibited. If you receive this e-mail in error,
> > please notify the sender by phone or email immediately and delete it!
> > ___
> > Ocfs2-devel mailing list
> > Ocfs2-devel@oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> >
___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-11 Thread Joseph Qi
Hi Changkuo,

You said s_umount was acquired by umount and ocfs2rec was blocked when
acquiring it. But you didn't describe why umount was blocked.

Thanks,
Joseph

On 18/1/12 11:43, Shichangkuo wrote:
> Hi all,
>   Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock with 
> umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack as follows:
> journal recovery work:
> [] call_rwsem_down_read_failed+0x14/0x30
> [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
> [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
> [] process_one_work+0x130/0x350
> [] worker_thread+0x46/0x3b0
> [] kthread+0x101/0x140
> [] ret_from_fork+0x1f/0x30
> [] 0x
> 
> /bin/umount:
> [] flush_workqueue+0x104/0x3e0
> [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
> [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
> [] ocfs2_put_super+0x31/0xa0 [ocfs2]
> [] generic_shutdown_super+0x6d/0x120
> [] kill_block_super+0x2d/0x60
> [] deactivate_locked_super+0x51/0x90
> [] cleanup_mnt+0x3b/0x70
> [] task_work_run+0x86/0xa0
> [] exit_to_usermode_loop+0x6d/0xa9
> [] do_syscall_64+0x11d/0x130
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x
>   
> Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was 
> already locked by umount thread, then get a deadlock.
> This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 and 
> 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
> I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was already 
> removed.
> Shall we add a new mutex?
> 
> Thanks
> Changkuo
> -
> 本邮件及其附件含有新华三技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from New 
> H3C, which is
> intended only for the person or entity whose address is listed above. Any use 
> of the
> information contained herein in any way (including, but not limited to, total 
> or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender
> by phone or email immediately and delete it!
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-11 Thread Shichangkuo
Hi all,
  Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock with 
umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack as follows:
journal recovery work:
[] call_rwsem_down_read_failed+0x14/0x30
[] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
[] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
[] process_one_work+0x130/0x350
[] worker_thread+0x46/0x3b0
[] kthread+0x101/0x140
[] ret_from_fork+0x1f/0x30
[] 0x

/bin/umount:
[] flush_workqueue+0x104/0x3e0
[] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
[] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
[] ocfs2_put_super+0x31/0xa0 [ocfs2]
[] generic_shutdown_super+0x6d/0x120
[] kill_block_super+0x2d/0x60
[] deactivate_locked_super+0x51/0x90
[] cleanup_mnt+0x3b/0x70
[] task_work_run+0x86/0xa0
[] exit_to_usermode_loop+0x6d/0xa9
[] do_syscall_64+0x11d/0x130
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x
  
Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was already 
locked by umount thread, then get a deadlock.
This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 and 
5f530de63cfc6ca8571cbdf58af63fb166cc6517.
I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was already 
removed.
Shall we add a new mutex?

Thanks
Changkuo
-
本邮件及其附件含有新华三技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from New H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel