from:"piaojun"

Re: [Ocfs2-devel] [PATCH] ocfs2: submit another bio if current bio is full

2018-05-14 Thread piaojun

Hi Changwei,

I got your point, we should let the caller retry if bio is not enough,
right? But some caller like o2hb_issue_node_write() won't retry once error
happens, though the bio will always be enough. I think if we could
calculate the number of bio we need before calling bio_add_page()?

Thanks
Jun

On 2018/5/14 11:21, Changwei Ge wrote:
> Hi Jun,
> 
> Right now, I am afraid that the easiest and fasted way to fix this issue 
> is to revert your patch.
> 
>  From comments before function bio_add_page(), we can see that it only 
> fails if either ::bi_vcnt == ::bi_max_vecs or it's a cloned bio.
> 
> 
> So we can judge if bio is full from its return value is zero or not.
> 
> 
> Thanks,
> 
> Changwei
> 
> 
> On 2018/5/10 9:13, Changwei Ge wrote:
>>
>>
>> On 2018/5/10 8:24, piaojun wrote:
>>>
>>> On 2018/5/9 20:01, Changwei Ge wrote:
>>>> Hi Jun,
>>>>
>>>>
>>>> On 2018/5/9 18:08, piaojun wrote:
>>>>> Hi Changwei,
>>>>>
>>>>> On 2018/4/13 13:51, Changwei Ge wrote:
>>>>>> If cluster scale exceeds 16 nodes, bio will be full and 
>>>>>> bio_add_page()
>>>>>> returns 0 when adding pages to bio. Returning -EIO to 
>>>>>> o2hb_read_slots()
>>>>>> from o2hb_setup_one_bio() will lead to losing chance to allocate more
>>>>>> bios to present all heartbeat region.
>>>>>>
>>>>>> So o2hb_read_slots() fails.
>>>>>>
>>>>>> In my test, making fs fails in starting o2cb service.
>>>>>>
>>>>>> Attach error log:
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 
>>>>>> 4096, vec_start = 0
>>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] 
>>>>>> to bio failed, page ea0002d7ed40, len 0, vec_len 4096, 
>>>>>> vec_start 0, bi_sector 8192
>>>>>> (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
>>>>>> (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
>>>>>> (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5
>>>>>>
>>>>>> Fixes: ba16ddfbeb9d ("ocfs2/o2hb: check len for bio_add_page() to 
>>>>>> avoid getting incorrect bio"
>>>>>>
>>>>>> Signed-off-by: Changwei Ge <ge.chang...@h3c.com>
>>>>>> ---
>>>>>>fs/ocfs2/cluster/heartbeat.c | 8 ++--
>>>>&

Re: [Ocfs2-devel] [PATCH] ocfs2: submit another bio if current bio is full

2018-05-09 Thread piaojun



On 2018/5/9 20:01, Changwei Ge wrote:
> Hi Jun,
> 
> 
> On 2018/5/9 18:08, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/4/13 13:51, Changwei Ge wrote:
>>> If cluster scale exceeds 16 nodes, bio will be full and bio_add_page()
>>> returns 0 when adding pages to bio. Returning -EIO to o2hb_read_slots()
>>> from o2hb_setup_one_bio() will lead to losing chance to allocate more
>>> bios to present all heartbeat region.
>>>
>>> So o2hb_read_slots() fails.
>>>
>>> In my test, making fs fails in starting o2cb service.
>>>
>>> Attach error log:
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio 
>>> failed, page ea0002d7ed40, len 0, vec_len 4096, vec_start 0, bi_sector 
>>> 8192
>>> (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
>>> (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
>>> (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5
>>>
>>> Fixes: ba16ddfbeb9d ("ocfs2/o2hb: check len for bio_add_page() to avoid 
>>> getting incorrect bio"
>>>
>>> Signed-off-by: Changwei Ge <ge.chang...@h3c.com>
>>> ---
>>>   fs/ocfs2/cluster/heartbeat.c | 8 ++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>>> index 91a8889abf9b..2809e29d612d 100644
>>> --- a/fs/ocfs2/cluster/heartbeat.c
>>> +++ b/fs/ocfs2/cluster/heartbeat.c
>>> @@ -540,11 +540,12 @@ static struct bio *o2hb_setup_one_bio(struct 
>>> o2hb_region *reg,
>>> struct bio *bio;
>>> struct page *page;
>>>   
>>> +#define O2HB_BIO_VECS 16
>>> /* Testing has shown this allocation to take long enough under
>>>  * GFP_KERNEL that the local node can get fenced. It would be
>>>  * nicest if we could pre-allocate these bios and avoid this
>>>  * all together. */
>>> -   bio = bio_alloc(GFP_ATOMIC, 16);
>>> +   bio = bio_alloc(GFP_ATOMIC, O2HB_BIO_VECS);
>>> if (!bio) {
>>> mlog(ML_ERROR, "Could not alloc slots BIO!\n");
>>> bio = ERR_PTR(-ENOMEM);
>>> @@ -570,7 +571,10 @@ static struct bio *o2hb_setup_one_bio(struct 
>>> o2hb_region *reg,
>>>  current_page, vec_len, vec_start);
>>>   
>> Should we check the validity of 'current_page' before bio_add_page()? And
>> that will prevent error happen. Others looks OK.
> 
> If I understand correctly, you mean we should check current page  is 
> NULL or not?
> If so I think there is no need since o2hb should guarantee that it has 
> already reserved enough pages for disk heartbeat read/write behalf.

I mean we could check if 'current_page' e

Re: [Ocfs2-devel] [PATCH] ocfs2: submit another bio if current bio is full

2018-05-09 Thread piaojun

Hi Changwei,

On 2018/4/13 13:51, Changwei Ge wrote:
> If cluster scale exceeds 16 nodes, bio will be full and bio_add_page()
> returns 0 when adding pages to bio. Returning -EIO to o2hb_read_slots()
> from o2hb_setup_one_bio() will lead to losing chance to allocate more
> bios to present all heartbeat region.
> 
> So o2hb_read_slots() fails.
> 
> In my test, making fs fails in starting o2cb service.
> 
> Attach error log:
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio 
> failed, page ea0002d7ed40, len 0, vec_len 4096, vec_start 0, bi_sector 
> 8192
> (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
> (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
> (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5
> 
> Fixes: ba16ddfbeb9d ("ocfs2/o2hb: check len for bio_add_page() to avoid 
> getting incorrect bio"
> 
> Signed-off-by: Changwei Ge 
> ---
>  fs/ocfs2/cluster/heartbeat.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
> index 91a8889abf9b..2809e29d612d 100644
> --- a/fs/ocfs2/cluster/heartbeat.c
> +++ b/fs/ocfs2/cluster/heartbeat.c
> @@ -540,11 +540,12 @@ static struct bio *o2hb_setup_one_bio(struct 
> o2hb_region *reg,
>   struct bio *bio;
>   struct page *page;
>  
> +#define O2HB_BIO_VECS 16
>   /* Testing has shown this allocation to take long enough under
>* GFP_KERNEL that the local node can get fenced. It would be
>* nicest if we could pre-allocate these bios and avoid this
>* all together. */
> - bio = bio_alloc(GFP_ATOMIC, 16);
> + bio = bio_alloc(GFP_ATOMIC, O2HB_BIO_VECS);
>   if (!bio) {
>   mlog(ML_ERROR, "Could not alloc slots BIO!\n");
>   bio = ERR_PTR(-ENOMEM);
> @@ -570,7 +571,10 @@ static struct bio *o2hb_setup_one_bio(struct o2hb_region 
> *reg,
>current_page, vec_len, vec_start);
>  
Should we check the validity of 'current_page' before bio_add_page()? And
that will prevent error happen. Others looks OK.

thanks,
Jun
>   len = bio_add_page(bio, page, vec_len, vec_start);
> - if (len != vec_len) {
> + if (len == 0 && current_page == O2HB_BIO_VECS) {
> + /* bio is full now. */
> + goto bail;
> + } else if (len != vec_len) {
>   mlog(ML_ERROR, "Adding page[%d] to bio failed, "
>"page %p, len %d, vec_len %u, vec_start %u, "
>"bi_sector %llu\n", current_page, page, len,
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: submit another bio if current bio is full

2018-05-09 Thread piaojun

Hi Changwei,

I understand your fix already, but I'm still confused by the comments
"If cluster scale exceeds 16 nodes, ...".
Do you mean that this problem will happen if nodes' count exceeds 16.

thanks,
Jun

On 2018/5/9 17:06, Changwei Ge wrote:
> Hi Jun,
> 
> 
> On 2018/5/9 16:50, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/5/8 23:57, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> Sorry for this so late reply since I was very busy those days.
>>>
>>>
>>> On 04/16/2018 11:44 AM, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> Do you mean that if the slotnum exceed 16 like 'mkfs.ocfs2 -N 17', you
>>>> still let it go rather than reture error?
>>> If your assumption is right, do you mean that ocfs2 slots can't exceed 16?
>>>
>>> If you return error once slots exceed 16, mkfs will never succeed.
>>>
>>> So if we can ensure that bio is full in current iteration, we should run
>>> into next iteration and allocate a
>>> new bio adding pages and continue.
>>>
>>> And your patch makes my ocfs2-test fail.
>>>
>>>
>>> Thanks,
>>> Changwei
>>>
>>>> thanks,
>>>> Jun
>>>>
>>>> On 2018/4/13 13:51, Changwei Ge wrote:
>>>>> If cluster scale exceeds 16 nodes, bio will be full and bio_add_page()
>> Sorry for misunderstanding your fix, and do you mean that the node num is
>> a little big which could not be covered by 16 pages, such as 129?
>>
>> "one page could cover 8 node's slots"
> It has nothing to do with the capacity of page holding slots.
> It's about how many vecs a bio can have.
> 
> For your reference, bio_alloc() has set the maximum vec to 16 in 
> o2hb_setup_one_bio() as precondition.
> 
> Thanks,
> Changwei
> 
>>
>> thanks,
>> Jun
>>
>>>>> returns 0 when adding pages to bio. Returning -EIO to o2hb_read_slots()
>>>>> from o2hb_setup_one_bio() will lead to losing chance to allocate more
>>>>> bios to present all heartbeat region.
>>>>>
>>>>> So o2hb_read_slots() fails.
>>>>>
>>>>> In my test, making fs fails in starting o2cb service.
>>>>>
>>>>> Attach error log:
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, 
>>>>> vec_start = 0
>>>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio 
>>>>> failed, page ea0002d7ed40, len 0, vec_len 4096, vec_start 0, 
>>>>> bi_sector 8192
>>>>> (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
>>>>> (mkfs.ocfs2,2

Re: [Ocfs2-devel] [PATCH] ocfs2: submit another bio if current bio is full

2018-05-09 Thread piaojun

Hi Changwei,

On 2018/5/8 23:57, Changwei Ge wrote:
> Hi Jun,
> 
> Sorry for this so late reply since I was very busy those days.
> 
> 
> On 04/16/2018 11:44 AM, piaojun wrote:
>> Hi Changwei,
>>
>> Do you mean that if the slotnum exceed 16 like 'mkfs.ocfs2 -N 17', you
>> still let it go rather than reture error?
> 
> If your assumption is right, do you mean that ocfs2 slots can't exceed 16?
> 
> If you return error once slots exceed 16, mkfs will never succeed.
> 
> So if we can ensure that bio is full in current iteration, we should run 
> into next iteration and allocate a
> new bio adding pages and continue.
> 
> And your patch makes my ocfs2-test fail.
> 
> 
> Thanks,
> Changwei
> 
>>
>> thanks,
>> Jun
>>
>> On 2018/4/13 13:51, Changwei Ge wrote:
>>> If cluster scale exceeds 16 nodes, bio will be full and bio_add_page()

Sorry for misunderstanding your fix, and do you mean that the node num is
a little big which could not be covered by 16 pages, such as 129?

"one page could cover 8 node's slots"

thanks,
Jun

>>> returns 0 when adding pages to bio. Returning -EIO to o2hb_read_slots()
>>> from o2hb_setup_one_bio() will lead to losing chance to allocate more
>>> bios to present all heartbeat region.
>>>
>>> So o2hb_read_slots() fails.
>>>
>>> In my test, making fs fails in starting o2cb service.
>>>
>>> Attach error log:
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, 
>>> vec_start = 0
>>> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio 
>>> failed, page ea0002d7ed40, len 0, vec_len 4096, vec_start 0, bi_sector 
>>> 8192
>>> (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
>>> (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
>>> (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5
>>>
>>> Fixes: ba16ddfbeb9d ("ocfs2/o2hb: check len for bio_add_page() to avoid 
>>> getting incorrect bio"
>>>
>>> Signed-off-by: Changwei Ge <ge.chang...@h3c.com>
>>> ---
>>>   fs/ocfs2/cluster/heartbeat.c | 8 ++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>>> index 91a8889abf9b..2809e29d612d 100644
>>> --- a/fs/ocfs2/cluster/heartbeat.c
>>> +++ b/fs/ocfs2/cluster/heartbeat.c
>>> @@ -540,11 +540,12 @@ static struct bio *o2hb_setup_one_bio(struct 
>>> o2hb_region *reg,
>>> struct bio *bio;
>>> struct page *page;
>>>   
>>> +#define O2HB_BIO_VECS 16
>>> /* Testing has shown this allocation to take long enough under
>>>  * GFP_KERNEL that the local node can get fenced. It would be
>>>  * nicest if we could pre-allocate these bios and avoid this
>>>  * all together. */
>>> -

Re: [Ocfs2-devel] [PATCH] fix a compiling warning

2018-05-06 Thread piaojun

Hi Larry,

'had_lock' will be initialized by ocfs2_inode_lock_tracker(), and the
'bail' branch above won't use it either as 'inode_locked' is still zero.

thanks,
Jun

On 2018/5/6 17:49, Larry Chen wrote:
> The variable had_lock might be used uninitialized.
> 
> Signed-off-by: Larry Chen 
> ---
>  fs/ocfs2/file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 6ee94bc23f5b..50f17f56db36 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1133,7 +1133,7 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr 
> *attr)
>   handle_t *handle = NULL;
>   struct dquot *transfer_to[MAXQUOTAS] = { };
>   int qtype;
> - int had_lock;
> + int had_lock = 0;
>   struct ocfs2_lock_holder oh;
>  
>   trace_ocfs2_setattr(inode, dentry,
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v3] ocfs2: clean up redundant function declarations

2018-04-25 Thread piaojun

LGTM

On 2018/4/25 19:21, Jia Guo wrote:
> The function ocfs2_extend_allocation has been deleted, clean up its
> declaration. Also change the static function name from
> __ocfs2_extend_allocation() to ocfs2_extend_allocation() to be
> consistent with the corresponding trace events as well as comments
> for ocfs2_lock_allocators().
> 
> Fixes: 964f14a0d350 ("ocfs2: clean up some dead code")
> 
> Signed-off-by: Jia Guo 
> Acked-by: Joseph Qi 
Reviewed-by: Jun Piao 
> ---
>  fs/ocfs2/file.c | 10 +-
>  fs/ocfs2/file.h |  2 --
>  2 files changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 6ee94bc..a2a8603 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -563,8 +563,8 @@ int ocfs2_add_inode_data(struct ocfs2_super *osb,
>   return ret;
>  }
> 
> -static int __ocfs2_extend_allocation(struct inode *inode, u32 logical_start,
> -  u32 clusters_to_add, int mark_unwritten)
> +static int ocfs2_extend_allocation(struct inode *inode, u32 logical_start,
> +u32 clusters_to_add, int mark_unwritten)
>  {
>   int status = 0;
>   int restart_func = 0;
> @@ -1035,8 +1035,8 @@ int ocfs2_extend_no_holes(struct inode *inode, struct 
> buffer_head *di_bh,
>   clusters_to_add -= oi->ip_clusters;
> 
>   if (clusters_to_add) {
> - ret = __ocfs2_extend_allocation(inode, oi->ip_clusters,
> - clusters_to_add, 0);
> + ret = ocfs2_extend_allocation(inode, oi->ip_clusters,
> +   clusters_to_add, 0);
>   if (ret) {
>   mlog_errno(ret);
>   goto out;
> @@ -1493,7 +1493,7 @@ static int ocfs2_allocate_unwritten_extents(struct 
> inode *inode,
>   goto next;
>   }
> 
> - ret = __ocfs2_extend_allocation(inode, cpos, alloc_size, 1);
> + ret = ocfs2_extend_allocation(inode, cpos, alloc_size, 1);
>   if (ret) {
>   if (ret != -ENOSPC)
>   mlog_errno(ret);
> diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
> index 1fdc983..7eb7f03 100644
> --- a/fs/ocfs2/file.h
> +++ b/fs/ocfs2/file.h
> @@ -65,8 +65,6 @@ int ocfs2_extend_no_holes(struct inode *inode, struct 
> buffer_head *di_bh,
> u64 new_i_size, u64 zero_to);
>  int ocfs2_zero_extend(struct inode *inode, struct buffer_head *di_bh,
> loff_t zero_to);
> -int ocfs2_extend_allocation(struct inode *inode, u32 logical_start,
> - u32 clusters_to_add, int mark_unwritten);
>  int ocfs2_setattr(struct dentry *dentry, struct iattr *attr);
>  int ocfs2_getattr(const struct path *path, struct kstat *stat,
> u32 request_mask, unsigned int flags);
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: submit another bio if current bio is full

2018-04-15 Thread piaojun

Hi Changwei,

Do you mean that if the slotnum exceed 16 like 'mkfs.ocfs2 -N 17', you
still let it go rather than reture error?

thanks,
Jun

On 2018/4/13 13:51, Changwei Ge wrote:
> If cluster scale exceeds 16 nodes, bio will be full and bio_add_page()
> returns 0 when adding pages to bio. Returning -EIO to o2hb_read_slots()
> from o2hb_setup_one_bio() will lead to losing chance to allocate more
> bios to present all heartbeat region.
> 
> So o2hb_read_slots() fails.
> 
> In my test, making fs fails in starting o2cb service.
> 
> Attach error log:
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, vec_start 
> = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, 
> vec_start = 0
> (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio 
> failed, page ea0002d7ed40, len 0, vec_len 4096, vec_start 0, bi_sector 
> 8192
> (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
> (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
> (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5
> 
> Fixes: ba16ddfbeb9d ("ocfs2/o2hb: check len for bio_add_page() to avoid 
> getting incorrect bio"
> 
> Signed-off-by: Changwei Ge 
> ---
>  fs/ocfs2/cluster/heartbeat.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
> index 91a8889abf9b..2809e29d612d 100644
> --- a/fs/ocfs2/cluster/heartbeat.c
> +++ b/fs/ocfs2/cluster/heartbeat.c
> @@ -540,11 +540,12 @@ static struct bio *o2hb_setup_one_bio(struct 
> o2hb_region *reg,
>   struct bio *bio;
>   struct page *page;
>  
> +#define O2HB_BIO_VECS 16
>   /* Testing has shown this allocation to take long enough under
>* GFP_KERNEL that the local node can get fenced. It would be
>* nicest if we could pre-allocate these bios and avoid this
>* all together. */
> - bio = bio_alloc(GFP_ATOMIC, 16);
> + bio = bio_alloc(GFP_ATOMIC, O2HB_BIO_VECS);
>   if (!bio) {
>   mlog(ML_ERROR, "Could not alloc slots BIO!\n");
>   bio = ERR_PTR(-ENOMEM);
> @@ -570,7 +571,10 @@ static struct bio *o2hb_setup_one_bio(struct o2hb_region 
> *reg,
>current_page, vec_len, vec_start);
>  
>   len = bio_add_page(bio, page, vec_len, vec_start);
> - if (len != vec_len) {
> + if (len == 0 && current_page == O2HB_BIO_VECS) {
> + /* bio is full now. */
> + goto bail;
> + } else if (len != vec_len) {
>   mlog(ML_ERROR, "Adding page[%d] to bio failed, "
>"page %p, len %d, vec_len %u, vec_start %u, "
>"bi_sector %llu\n", current_page, page, len,
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH V2] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir

2018-04-11 Thread piaojun



On 2018/4/12 3:31, Ashish Samant wrote:
> While reflinking an inode, we create a new inode in orphan directory, then
> take EX lock on it, reflink the original inode to orphan inode and release
> EX lock. Once the lock is released another node could request it in EX mode
> from ocfs2_recover_orphans() which causes downconvert of the lock, on this
> node, to NL mode.
> 
> Later we attempt to initialize security acl for the orphan inode and move
> it to the reflink destination. However, while doing this we dont take EX
> lock on the inode. This could potentially cause problems because we could
> be starting transaction, accessing journal and modifying metadata of the
> inode while holding NL lock and with another node holding EX lock on the
> inode.
> 
> Fix this by taking orphan inode cluster lock in EX mode before
> initializing security and moving orphan inode to reflink destination.
> Use the __tracker variant while taking inode lock to avoid recursive
> locking in the ocfs2_init_security_and_acl() call chain.
> 
> Signed-off-by: Ashish Samant 
Acked-by: Jun Piao 
> 
> V1->V2:
> Modify commit message to better reflect the problem in upstream kernel.
> ---
>  fs/ocfs2/refcounttree.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> index ab156e3..1b1283f 100644
> --- a/fs/ocfs2/refcounttree.c
> +++ b/fs/ocfs2/refcounttree.c
> @@ -4250,10 +4250,11 @@ static int __ocfs2_reflink(struct dentry *old_dentry,
>  static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir,
>struct dentry *new_dentry, bool preserve)
>  {
> - int error;
> + int error, had_lock;
>   struct inode *inode = d_inode(old_dentry);
>   struct buffer_head *old_bh = NULL;
>   struct inode *new_orphan_inode = NULL;
> + struct ocfs2_lock_holder oh;
>  
>   if (!ocfs2_refcount_tree(OCFS2_SB(inode->i_sb)))
>   return -EOPNOTSUPP;
> @@ -4295,6 +4296,14 @@ static int ocfs2_reflink(struct dentry *old_dentry, 
> struct inode *dir,
>   goto out;
>   }
>  
> + had_lock = ocfs2_inode_lock_tracker(new_orphan_inode, NULL, 1,
> + );
> + if (had_lock < 0) {
> + error = had_lock;
> + mlog_errno(error);
> + goto out;
> + }
> +
>   /* If the security isn't preserved, we need to re-initialize them. */
>   if (!preserve) {
>   error = ocfs2_init_security_and_acl(dir, new_orphan_inode,
> @@ -4302,14 +4311,15 @@ static int ocfs2_reflink(struct dentry *old_dentry, 
> struct inode *dir,
>   if (error)
>   mlog_errno(error);
>   }
> -out:
>   if (!error) {
>   error = ocfs2_mv_orphaned_inode_to_new(dir, new_orphan_inode,
>  new_dentry);
>   if (error)
>   mlog_errno(error);
>   }
> + ocfs2_inode_unlock_tracker(new_orphan_inode, 1, , had_lock);
>  
> +out:
>   if (new_orphan_inode) {
>   /*
>* We need to open_unlock the inode no matter whether we
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: don't use iocb when EIOCBQUEUED returns

2018-04-10 Thread piaojun

Hi Changwei,

It seems other codes which try to access 'iocb' will also cause error,
right? I think we should find the reason why 'iocb' is freed first.

thanks,
Jun

On 2018/4/11 9:07, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/4/11 8:52, piaojun wrote:
>> Hi Changwei,
>>
>> It looks like a code bug, and 'iocb' should not be freed at this place.
>> Could this BUG reproduced easily?
> 
> Actually, it's not easy to be reproduced since IO is much slower than CPU 
> executing instructions. But the logic here is broken, we'd better fix this.
> 
> Thanks,
> Changwei
> 
>>
>> thanks,
>> Jun
>>
>> On 2018/4/10 20:00, Changwei Ge wrote:
>>> When -EIOCBQUEUED returns, it means that aio_complete() will be called
>>> from dio_complete(), which is an asynchronous progress against write_iter.
>>> Generally, IO is a very slow progress than executing instruction, but we
>>> still can't take the risk to access a freed iocb.
>>>
>>> And we do face a BUG crash issue.
>>> >From crash tool, iocb is obviously freed already.
>>> crash> struct -x kiocb 881a350f5900
>>> struct kiocb {
>>>ki_filp = 0x881a350f5a80,
>>>ki_pos = 0x0,
>>>ki_complete = 0x0,
>>>private = 0x0,
>>>ki_flags = 0x0
>>> }
>>>
>>> And the backtrace shows:
>>> ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
>>> ? ocfs2_check_range_for_refcount+0x150/0x150 [ocfs2]
>>> aio_run_iocb+0x229/0x2f0
>>> ? try_to_wake_up+0x380/0x380
>>> do_io_submit+0x291/0x540
>>> ? syscall_trace_leave+0xad/0x130
>>> SyS_io_submit+0x10/0x20
>>> system_call_fastpath+0x16/0x75
>>>
>>> Signed-off-by: Changwei Ge <ge.chang...@h3c.com>
>>> ---
>>>   fs/ocfs2/file.c | 4 ++--
>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>>> index 5d1784a..1393ff2 100644
>>> --- a/fs/ocfs2/file.c
>>> +++ b/fs/ocfs2/file.c
>>> @@ -2343,7 +2343,7 @@ static ssize_t ocfs2_file_write_iter(struct kiocb 
>>> *iocb,
>>>   
>>> written = __generic_file_write_iter(iocb, from);
>>> /* buffered aio wouldn't have proper lock coverage today */
>>> -   BUG_ON(written == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
>>> +   BUG_ON(written == -EIOCBQUEUED && !direct_io);
>>>   
>>> /*
>>>  * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
>>> @@ -2463,7 +2463,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb 
>>> *iocb,
>>> trace_generic_file_aio_read_ret(ret);
>>>   
>>> /* buffered aio wouldn't have proper lock coverage today */
>>> -   BUG_ON(ret == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
>>> +   BUG_ON(ret == -EIOCBQUEUED && !direct_io);
>>>   
>>> /* see ocfs2_file_write_iter */
>>> if (ret == -EIOCBQUEUED || !ocfs2_iocb_is_rw_locked(iocb)) {
>>>
>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: don't put and assign null to bh allocated outside

2018-04-10 Thread piaojun

Hi Changwei,

On 2018/4/11 9:14, Changwei Ge wrote:
> Hi Jun,
> 
> Thanks for your review.
> 
> On 2018/4/11 8:40, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/4/10 19:35, Changwei Ge wrote:
>>> ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
>>> several blocks from disk. Currently, the input argument *bhs* can be
>>> NULL or NOT. It depends on the caller's behavior. If the function fails
>>> in reading blocks from disk, the corresponding bh will be assigned to
>>> NULL and put.
>>>
>>> Obviously, above process for non-NULL input bh is not appropriate.
>>> Because the caller doesn't even know its bhs are put and re-assigned.
>>>
>>> If buffer head is managed by caller, ocfs2_read_blocks and
>>> ocfs2_read_blocks_sync()  should not evaluate it to NULL. It will
>>> cause caller accessing illegal memory, thus crash.
>>>
>>> Also, we should put bhs which have succeeded in reading before current
>>> read failure.
>>>
>>> Signed-off-by: Changwei Ge <ge.chang...@h3c.com>
>>> ---
>>>   fs/ocfs2/buffer_head_io.c | 77 
>>> ---
>>>   1 file changed, 59 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
>>> index d9ebe11..7ae4147 100644
>>> --- a/fs/ocfs2/buffer_head_io.c
>>> +++ b/fs/ocfs2/buffer_head_io.c
>>> @@ -99,25 +99,34 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct 
>>> buffer_head *bh,
>>> return ret;
>>>   }
>>>   
>>> +/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
>>> + * will be easier to handle read failure.
>>> + */
>>>   int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 block,
>>>unsigned int nr, struct buffer_head *bhs[])
>>>   {
>>> int status = 0;
>>> unsigned int i;
>>> struct buffer_head *bh;
>>> +   int new_bh = 0;
>>>   
>>> trace_ocfs2_read_blocks_sync((unsigned long long)block, nr);
>>>   
>>> if (!nr)
>>> goto bail;
>>>   
>>> +   /* Don't put buffer head and re-assign it to NULL if it is allocated
>>> +* outside since the call can't be aware of this alternation!
>>> +*/
>>> +   new_bh = (bhs[0] == NULL);
>>
>> 'new_bh' just means the first bh is NULL, what if the middle bh is NULL?
> 
> I am afraid we have to assume that if the first bh is NULL, others must to be 
> NULL as well. Otherwise this function's exception handling will be rather 
> complicated and messy. So I add a comment before this function to remind the 
> caller should pass appropriate arguments in.
> And I checked the callers to this function, they behave like my assumption.
> 
> Moreover, who would pass a so strange bh array in?

What about restricting the outside caller's behaviour rather than handling
it in ocfs2_read_blocks_sync()? Will it be easier to do this?

thanks,
Jum

> 
>>
>>> +
>>> for (i = 0 ; i < nr ; i++) {
>>> if (bhs[i] == NULL) {
>>> bhs[i] = sb_getblk(osb->sb, block++);
>>> if (bhs[i] == NULL) {
>>> status = -ENOMEM;
>>> mlog_errno(status);
>>> -   goto bail;
>>> +   break;
>>> }
>>> }
>>> bh = bhs[i];
>>> @@ -158,9 +167,26 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, 
>>> u64 block,
>>> submit_bh(REQ_OP_READ, 0, bh);
>>> }
>>>   
>>> +read_failure:
>>
>> This looks weird that 'read_failure' include the normal branch.
> 
> Sounds reasonable.
> How about if_read_failure or may_read_failure?
> 
> Thanks,
> Changwei
> 
>>
>>> for (i = nr; i > 0; i--) {
>>> bh = bhs[i - 1];
>>>   
>>> +   if (unlikely(status)) {
>>> +   if (new_bh && !bh) {
>>> +   /* If middle bh fails, let previous bh
>>> +* finish its read and then put it to
>>> +* aovoid bh leak
>>> +*/
>>> +   if (!buffer_jbd(bh))
>>> +

Re: [Ocfs2-devel] [PATCH] ocfs2: don't use iocb when EIOCBQUEUED returns

2018-04-10 Thread piaojun

Hi Changwei,

It looks like a code bug, and 'iocb' should not be freed at this place.
Could this BUG reproduced easily?

thanks,
Jun

On 2018/4/10 20:00, Changwei Ge wrote:
> When -EIOCBQUEUED returns, it means that aio_complete() will be called
> from dio_complete(), which is an asynchronous progress against write_iter.
> Generally, IO is a very slow progress than executing instruction, but we
> still can't take the risk to access a freed iocb.
> 
> And we do face a BUG crash issue.
>>From crash tool, iocb is obviously freed already.
> crash> struct -x kiocb 881a350f5900
> struct kiocb {
>   ki_filp = 0x881a350f5a80,
>   ki_pos = 0x0,
>   ki_complete = 0x0,
>   private = 0x0,
>   ki_flags = 0x0
> }
> 
> And the backtrace shows:
> ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
> ? ocfs2_check_range_for_refcount+0x150/0x150 [ocfs2]
> aio_run_iocb+0x229/0x2f0
> ? try_to_wake_up+0x380/0x380
> do_io_submit+0x291/0x540
> ? syscall_trace_leave+0xad/0x130
> SyS_io_submit+0x10/0x20
> system_call_fastpath+0x16/0x75
> 
> Signed-off-by: Changwei Ge 
> ---
>  fs/ocfs2/file.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 5d1784a..1393ff2 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2343,7 +2343,7 @@ static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
>  
>   written = __generic_file_write_iter(iocb, from);
>   /* buffered aio wouldn't have proper lock coverage today */
> - BUG_ON(written == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
> + BUG_ON(written == -EIOCBQUEUED && !direct_io);
>  
>   /*
>* deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
> @@ -2463,7 +2463,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
>   trace_generic_file_aio_read_ret(ret);
>  
>   /* buffered aio wouldn't have proper lock coverage today */
> - BUG_ON(ret == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
> + BUG_ON(ret == -EIOCBQUEUED && !direct_io);
>  
>   /* see ocfs2_file_write_iter */
>   if (ret == -EIOCBQUEUED || !ocfs2_iocb_is_rw_locked(iocb)) {
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: clean up unused stack variable in dlm_do_local_ast()

2018-04-10 Thread piaojun

Hi Changwei,

This patch had been merged into mainline already.

Please see patch a43d24cb3b0b.

thanks,
Jun

On 2018/4/10 19:49, Changwei Ge wrote:
> Signed-off-by: Changwei Ge 
> ---
>  fs/ocfs2/dlm/dlmast.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c
> index fd6..39831fc 100644
> --- a/fs/ocfs2/dlm/dlmast.c
> +++ b/fs/ocfs2/dlm/dlmast.c
> @@ -224,14 +224,12 @@ void dlm_do_local_ast(struct dlm_ctxt *dlm, struct 
> dlm_lock_resource *res,
> struct dlm_lock *lock)
>  {
>   dlm_astlockfunc_t *fn;
> - struct dlm_lockstatus *lksb;
>  
>   mlog(0, "%s: res %.*s, lock %u:%llu, Local AST\n", dlm->name,
>res->lockname.len, res->lockname.name,
>dlm_get_lock_cookie_node(be64_to_cpu(lock->ml.cookie)),
>dlm_get_lock_cookie_seq(be64_to_cpu(lock->ml.cookie)));
>  
> - lksb = lock->lksb;
>   fn = lock->ast;
>   BUG_ON(lock->ml.node != dlm->node_num);
>  
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: don't put and assign null to bh allocated outside

2018-04-10 Thread piaojun

Hi Changwei,

On 2018/4/10 19:35, Changwei Ge wrote:
> ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
> several blocks from disk. Currently, the input argument *bhs* can be
> NULL or NOT. It depends on the caller's behavior. If the function fails
> in reading blocks from disk, the corresponding bh will be assigned to
> NULL and put.
> 
> Obviously, above process for non-NULL input bh is not appropriate.
> Because the caller doesn't even know its bhs are put and re-assigned.
> 
> If buffer head is managed by caller, ocfs2_read_blocks and
> ocfs2_read_blocks_sync()  should not evaluate it to NULL. It will
> cause caller accessing illegal memory, thus crash.
> 
> Also, we should put bhs which have succeeded in reading before current
> read failure.
> 
> Signed-off-by: Changwei Ge 
> ---
>  fs/ocfs2/buffer_head_io.c | 77 
> ---
>  1 file changed, 59 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
> index d9ebe11..7ae4147 100644
> --- a/fs/ocfs2/buffer_head_io.c
> +++ b/fs/ocfs2/buffer_head_io.c
> @@ -99,25 +99,34 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct 
> buffer_head *bh,
>   return ret;
>  }
>  
> +/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
> + * will be easier to handle read failure.
> + */
>  int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 block,
>  unsigned int nr, struct buffer_head *bhs[])
>  {
>   int status = 0;
>   unsigned int i;
>   struct buffer_head *bh;
> + int new_bh = 0;
>  
>   trace_ocfs2_read_blocks_sync((unsigned long long)block, nr);
>  
>   if (!nr)
>   goto bail;
>  
> + /* Don't put buffer head and re-assign it to NULL if it is allocated
> +  * outside since the call can't be aware of this alternation!
> +  */
> + new_bh = (bhs[0] == NULL);

'new_bh' just means the first bh is NULL, what if the middle bh is NULL?

> +
>   for (i = 0 ; i < nr ; i++) {
>   if (bhs[i] == NULL) {
>   bhs[i] = sb_getblk(osb->sb, block++);
>   if (bhs[i] == NULL) {
>   status = -ENOMEM;
>   mlog_errno(status);
> - goto bail;
> + break;
>   }
>   }
>   bh = bhs[i];
> @@ -158,9 +167,26 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
> block,
>   submit_bh(REQ_OP_READ, 0, bh);
>   }
>  
> +read_failure:

This looks weird that 'read_failure' include the normal branch.

>   for (i = nr; i > 0; i--) {
>   bh = bhs[i - 1];
>  
> + if (unlikely(status)) {
> + if (new_bh && !bh) {
> + /* If middle bh fails, let previous bh
> +  * finish its read and then put it to
> +  * aovoid bh leak
> +  */
> + if (!buffer_jbd(bh))
> + wait_on_buffer(bh);
> + put_bh(bh);
> + bhs[i - 1] = NULL;
> + } else if (buffer_uptodate(bh)) {
> + clear_buffer_uptodate(bh);
> + }
> + continue;
> + }
> +
>   /* No need to wait on the buffer if it's managed by JBD. */
>   if (!buffer_jbd(bh))
>   wait_on_buffer(bh);
> @@ -170,8 +196,7 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
> block,
>* so we can safely record this and loop back
>* to cleanup the other buffers. */
>   status = -EIO;
> - put_bh(bh);
> - bhs[i - 1] = NULL;
> + goto read_failure;
>   }
>   }
>  
> @@ -179,6 +204,9 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
> block,
>   return status;
>  }
>  
> +/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
> + * will be easier to handle read failure.
> + */
>  int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 block, int nr,
> struct buffer_head *bhs[], int flags,
> int (*validate)(struct super_block *sb,
> @@ -188,6 +216,7 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
> block, int nr,
>   int i, ignore_cache = 0;
>   struct buffer_head *bh;
>   struct super_block *sb = ocfs2_metadata_cache_get_super(ci);
> + int new_bh = 0;
>  
>   trace_ocfs2_read_blocks_begin(ci, (unsigned long long)block, nr, flags);
>  
> @@ -213,6 +242,11 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
> block, int nr,
>   goto bail;
>

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: clean up unused variable in dlm_process_recovery_data

2018-04-02 Thread piaojun

LGTM

On 2018/4/3 13:42, Changwei Ge wrote:
> Signed-off-by: Changwei Ge 
Reviewed-by: Jun Piao 
> ---
>  fs/ocfs2/dlm/dlmrecovery.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index ec8f758..be6b067 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -1807,7 +1807,6 @@ static int dlm_process_recovery_data(struct dlm_ctxt 
> *dlm,
>   int i, j, bad;
>   struct dlm_lock *lock;
>   u8 from = O2NM_MAX_NODES;
> - unsigned int added = 0;
>   __be64 c;
>  
>   mlog(0, "running %d locks for this lockres\n", mres->num_locks);
> @@ -1823,7 +1822,6 @@ static int dlm_process_recovery_data(struct dlm_ctxt 
> *dlm,
>   spin_lock(>spinlock);
>   dlm_lockres_set_refmap_bit(dlm, res, from);
>   spin_unlock(>spinlock);
> - added++;
>   break;
>   }
>   BUG_ON(ml->highest_blocked != LKM_IVMODE);
> @@ -1911,7 +1909,6 @@ static int dlm_process_recovery_data(struct dlm_ctxt 
> *dlm,
>   /* do not alter lock refcount.  switching lists. */
>   list_move_tail(>list, queue);
>   spin_unlock(>spinlock);
> - added++;
>  
>   mlog(0, "just reordered a local lock!\n");
>   continue;
> @@ -2037,7 +2034,6 @@ static int dlm_process_recovery_data(struct dlm_ctxt 
> *dlm,
>"setting refmap bit\n", dlm->name,
>res->lockname.len, res->lockname.name, ml->node);
>   dlm_lockres_set_refmap_bit(dlm, res, ml->node);
> - added++;
>   }
>   spin_unlock(>spinlock);
>   }
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: wait for dlm recovery done when migrating all lock resources

2018-04-01 Thread piaojun

Hi Changwei,

On 2018/4/2 10:59, Changwei Ge wrote:
> Hi Jun,
> 
> It seems that you have ever posted this patch, If I remember correctly.
> My concern is still if you disable dlm recovery via ::migrate_done. Then flag 
> DLM_LOCK_RES_RECOVERING can't be cleared.
After I set migrate_done, all lockres had been migrated. So there won't be
any lockres with DLM_LOCK_RES_RECOVERING. And I have tested this patch for
a few months.

thanks,
Jun
> 
> So we can't purge the problematic lock resource since __dlm_lockres_unused() 
> needs to check that flag.
> 
> Finally, umount will run the while loop in dlm_migrate_all_locks() infinitely.
> Or if I miss something?
> 
> Thanks,
> Changwei
> 
> On 2018/3/15 21:00, piaojun wrote:
>> Wait for dlm recovery done when migrating all lock resources in case that
>> new lock resource left after leaving dlm domain. And the left lock
>> resource will cause other nodes BUG.
>>
>>NodeA   NodeBNodeC
>>
>> umount:
>>dlm_unregister_domain()
>>  dlm_migrate_all_locks()
>>
>>   NodeB down
>>
>> do recovery for NodeB
>> and collect a new lockres
>> form other live nodes:
>>
>>dlm_do_recovery
>>  dlm_remaster_locks
>>dlm_request_all_locks:
>>
>>dlm_mig_lockres_handler
>>  dlm_new_lockres
>>__dlm_insert_lockres
>>
>> at last NodeA become the
>> master of the new lockres
>> and leave domain:
>>dlm_leave_domain()
>>
>>mount:
>>  dlm_join_domain()
>>
>>touch file and request
>>for the owner of the new
>>lockres, but all the
>>other nodes said 'NO',
>>so NodeC decide to be
>>the owner, and send do
>>assert msg to other
>>nodes:
>>dlmlock()
>>  dlm_get_lock_resource()
>>dlm_do_assert_master()
>>
>>other nodes receive the 
>> msg
>>and found two masters 
>> exist.
>>at last cause BUG in
>>
>> dlm_assert_master_handler()
>>-->BUG();
>>
>> Fixes: bc9838c4d44a ("dlm: allow dlm do recovery during shutdown")
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
>> ---
>>   fs/ocfs2/dlm/dlmcommon.h   |  1 +
>>   fs/ocfs2/dlm/dlmdomain.c   | 15 +++
>>   fs/ocfs2/dlm/dlmrecovery.c | 13 ++---
>>   3 files changed, 26 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
>> index 953c200..d06e27e 100644
>> --- a/fs/ocfs2/dlm/dlmcommon.h
>> +++ b/fs/ocfs2/dlm/dlmcommon.h
>> @@ -140,6 +140,7 @@ struct dlm_ctxt
>>  u8 node_num;
>>  u32 key;
>>  u8  joining_node;
>> +u8 migrate_done; /* set to 1 means node has migrated all lock resources 
>> */
>>  wait_queue_head_t dlm_join_events;
>>  unsigned long live_nodes_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
>>  unsigned long domain_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
>> index 25b76f0..425081b 100644
>> --- a/fs/ocfs2/dlm/dlmdomain.c
>> +++ b/fs/ocfs2/dlm/dlmdomain.c
>> @@ -461,6 +461,19 @@ static int dlm_migrate_all_locks(struct dlm_ctxt *dlm)
>>  cond_resched_lock(>spinlock);
>>  num += n;
>>  }
>> +
>> +if (!num) {
>> +if (dlm->reco.state & DLM_RECO_STATE_ACTIVE) {
>> +mlog(0, "%s: perhaps there are more lock resources "
>> + "need to be mi

Re: [Ocfs2-devel] [PATCH] ocfs2: don't evaluate buffer head to NULL managed by caller

2018-03-29 Thread piaojun

Hi Joseph and Changwei,

On 2018/3/30 9:26, Joseph Qi wrote:
> 
> 
> On 18/3/29 10:06, Changwei Ge wrote:
>> ocfs2_read_blocks() is used to read several blocks from disk.
>> Currently, the input argument *bhs* can be NULL or NOT. It depends on
>> the caller's behavior. If the function fails in reading blocks from
>> disk, the corresponding bh will be assigned to NULL and put.
>>
>> Obviously, above process for non-NULL input bh is not appropriate.
>> Because the caller doesn't even know its bhs are put and re-assigned.
>>
>> If buffer head is managed by caller, ocfs2_read_blocks should not
>> evaluate it to NULL. It will cause caller accessing illegal memory,
>> thus crash.
>>
>> Signed-off-by: Changwei Ge 
>> ---
>>  fs/ocfs2/buffer_head_io.c | 31 +--
>>  1 file changed, 25 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
>> index d9ebe11..17329b6 100644
>> --- a/fs/ocfs2/buffer_head_io.c
>> +++ b/fs/ocfs2/buffer_head_io.c
>> @@ -188,6 +188,7 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
>> block, int nr,
>>  int i, ignore_cache = 0;
>>  struct buffer_head *bh;
>>  struct super_block *sb = ocfs2_metadata_cache_get_super(ci);
>> +int new_bh = 0;
>>  
>>  trace_ocfs2_read_blocks_begin(ci, (unsigned long long)block, nr, flags);
>>  
>> @@ -213,6 +214,18 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, 
>> u64 block, int nr,
>>  goto bail;
>>  }
>>  
>> +/* Use below trick to check if all bhs are NULL or assigned.
>> + * Basically, we hope all bhs are consistent so that we can
>> + * handle exception easily.
>> + */
>> +new_bh = (bhs[0] == NULL);
>> +for (i = 1 ; i < nr ; i++) {
>> +if ((new_bh && bhs[i]) || (!new_bh && !bhs[i])) {
>> +WARN(1, "Not all bhs are consistent\n");
>> +break;
>> +}
>> +}
>> +
>>  ocfs2_metadata_cache_io_lock(ci);
>>  for (i = 0 ; i < nr ; i++) {
>>  if (bhs[i] == NULL) {
>> @@ -324,8 +337,10 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, 
>> u64 block, int nr,
>>  if (!(flags & OCFS2_BH_READAHEAD)) {
>>  if (status) {
>>  /* Clear the rest of the buffers on error */
>> -put_bh(bh);
>> -bhs[i] = NULL;
>> +if (new_bh) {
>> +put_bh(bh);
>> +bhs[i] = NULL;
>> +}
> 
> Since we assume caller has to pass either all NULL or all non-NULL,
> here we will only put bh internal allocated. Am I missing something?
I think this branch will put bh external allocated as 'new_bh' only means
bhs[0] is internal allocated. So this branch seems inappropriate.

thanks,
Jun
> 
> Thanks,
> Joseph
> 
>>  continue;
>>  }
>>  /* We know this can't have changed as we hold the
>> @@ -342,8 +357,10 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, 
>> u64 block, int nr,
>>   * for this bh as it's not marked locally
>>   * uptodate. */
>>  status = -EIO;
>> -put_bh(bh);
>> -bhs[i] = NULL;
>> +if (new_bh) {
>> +put_bh(bh);
>> +bhs[i] = NULL;
>> +}
>>  continue;
>>  }
>>  
>> @@ -355,8 +372,10 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, 
>> u64 block, int nr,
>>  clear_buffer_needs_validate(bh);
>>  status = validate(sb, bh);
>>  if (status) {
>> -put_bh(bh);
>> -bhs[i] = NULL;
>> +if (new_bh) {
>> +put_bh(bh);
>> +bhs[i] = NULL;
>> +}
>>  continue;
>>  }
>>  }
>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: don't evaluate buffer head to NULL managed by caller

2018-03-29 Thread piaojun

Hi Changwei,

On 2018/3/29 10:06, Changwei Ge wrote:
> ocfs2_read_blocks() is used to read several blocks from disk.
> Currently, the input argument *bhs* can be NULL or NOT. It depends on
> the caller's behavior. If the function fails in reading blocks from
> disk, the corresponding bh will be assigned to NULL and put.
> 
> Obviously, above process for non-NULL input bh is not appropriate.
> Because the caller doesn't even know its bhs are put and re-assigned.
> 
> If buffer head is managed by caller, ocfs2_read_blocks should not
> evaluate it to NULL. It will cause caller accessing illegal memory,
> thus crash.
> 
> Signed-off-by: Changwei Ge 
> ---
>  fs/ocfs2/buffer_head_io.c | 31 +--
>  1 file changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
> index d9ebe11..17329b6 100644
> --- a/fs/ocfs2/buffer_head_io.c
> +++ b/fs/ocfs2/buffer_head_io.c
> @@ -188,6 +188,7 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
> block, int nr,
>   int i, ignore_cache = 0;
>   struct buffer_head *bh;
>   struct super_block *sb = ocfs2_metadata_cache_get_super(ci);
> + int new_bh = 0;
>  
>   trace_ocfs2_read_blocks_begin(ci, (unsigned long long)block, nr, flags);
>  
> @@ -213,6 +214,18 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
> block, int nr,
>   goto bail;
>   }
>  
> + /* Use below trick to check if all bhs are NULL or assigned.
> +  * Basically, we hope all bhs are consistent so that we can
> +  * handle exception easily.
> +  */
> + new_bh = (bhs[0] == NULL);
> + for (i = 1 ; i < nr ; i++) {
> + if ((new_bh && bhs[i]) || (!new_bh && !bhs[i])) {
> + WARN(1, "Not all bhs are consistent\n");
> + break;
> + }
> + }
> +
>   ocfs2_metadata_cache_io_lock(ci);
>   for (i = 0 ; i < nr ; i++) {
>   if (bhs[i] == NULL) {
> @@ -324,8 +337,10 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
> block, int nr,
>   if (!(flags & OCFS2_BH_READAHEAD)) {
>   if (status) {
>   /* Clear the rest of the buffers on error */
> - put_bh(bh);
> - bhs[i] = NULL;
> + if (new_bh) {
> + put_bh(bh);
> + bhs[i] = NULL;
> + }
>   continue;
>   }
>   /* We know this can't have changed as we hold the
> @@ -342,8 +357,10 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
> block, int nr,
>* for this bh as it's not marked locally
>* uptodate. */
>   status = -EIO;
> - put_bh(bh);
> - bhs[i] = NULL;
> + if (new_bh) {
> + put_bh(bh);
> + bhs[i] = NULL;
> + }
How to make suer 'bhs[i]' is not allocated by user according to 'new_bh'?
'new_bh' equis 1 only means 'bhs[0]' is allocated by ocfs2_read_blocks()
and we should put it here, right?

thanks,
Jun
>   continue;
>   }
>  
> @@ -355,8 +372,10 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
> block, int nr,
>   clear_buffer_needs_validate(bh);
>   status = validate(sb, bh);
>   if (status) {
> - put_bh(bh);
> - bhs[i] = NULL;
> + if (new_bh) {
> + put_bh(bh);
> + bhs[i] = NULL;
> + }
>   continue;
>   }
>   }
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/o2hb: check len for bio_add_page() to avoid submitting incorrect bio

2018-03-28 Thread piaojun

Hi Changwei and Joseph,

EIO sounds more reasonable, thanks a lot for your suggestions, and I will
send patch v2 later.

thanks,
Jun

On 2018/3/29 9:09, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/3/28 17:51, Joseph Qi wrote:
>>
>>
>> On 18/3/28 15:02, piaojun wrote:
>>> Hi Joseph,
>>>
>>> On 2018/3/28 12:58, Joseph Qi wrote:
>>>>
>>>>
>>>> On 18/3/28 11:50, piaojun wrote:
>>>>> We need check len for bio_add_page() to make sure the bio has been set up
>>>>> correctly, otherwise we may submit incorrect data to device.
>>>>>
>>>>> Signed-off-by: Jun Piao <piao...@huawei.com>
>>>>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
>>>>> ---
>>>>>   fs/ocfs2/cluster/heartbeat.c | 11 ++-
>>>>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>>>>> index ea8c551..43ad79f 100644
>>>>> --- a/fs/ocfs2/cluster/heartbeat.c
>>>>> +++ b/fs/ocfs2/cluster/heartbeat.c
>>>>> @@ -570,7 +570,16 @@ static struct bio *o2hb_setup_one_bio(struct 
>>>>> o2hb_region *reg,
>>>>>current_page, vec_len, vec_start);
>>>>>
>>>>>   len = bio_add_page(bio, page, vec_len, vec_start);
>>>>> - if (len != vec_len) break;
>>>>> + if (len != vec_len) {
>>>>> + mlog(ML_ERROR, "Adding page[%d] to bio failed, "
>>>>> +  "page %p, len %d, vec_len %u, vec_start %u, "
>>>>> +  "bi_sector %llu\n", current_page, page, len,
>>>>> +  vec_len, vec_start,
>>>>> +  (unsigned long long)bio->bi_iter.bi_sector);
>>>>> + bio_put(bio);
>>>>> + bio = ERR_PTR(-EFAULT);
>>>>
>>>> IMO, EFAULT is not an appropriate error code here.
>>>> If __bio_add_page returns 0, some are caused by bio checking failed.
>>>> Also I've noticed that several other callers just use ENOMEM, so I think
>>>> EINVAL or ENOMEM may be better.
>>>
>>> __bio_add_page has been deleted in patch c66a14d07c13, and I notice that
>>> other callers always use -EFAULT or -EIO. I'm afraid we are not basing on
>>> the same kernel source.
>>>
>>
>> Oops... Yes, I was looking an old kernel...
>> EIO sounds reasonable, but I don't know why EFAULT since it means "Bad 
>> address".
> 
> I agree with Joseph that EFAULT seems unreasonable for this exception cached.
> But your trick looks good to me.
> After applying a more appropriate error number, please feel free to add my:
> Reviewed-by: Changwei Ge <ge.chang...@h3c.com>
> 
> Thanks,
> Changwei
> 
> 
>>
>> Thanks,
>> Joseph
>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2/o2hb: check len for bio_add_page() to avoid submitting incorrect bio

2018-03-27 Thread piaojun

We need check len for bio_add_page() to make sure the bio has been set up
correctly, otherwise we may submit incorrect data to device.

Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/cluster/heartbeat.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index ea8c551..43ad79f 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -570,7 +570,16 @@ static struct bio *o2hb_setup_one_bio(struct o2hb_region 
*reg,
 current_page, vec_len, vec_start);

len = bio_add_page(bio, page, vec_len, vec_start);
-   if (len != vec_len) break;
+   if (len != vec_len) {
+   mlog(ML_ERROR, "Adding page[%d] to bio failed, "
+"page %p, len %d, vec_len %u, vec_start %u, "
+"bi_sector %llu\n", current_page, page, len,
+vec_len, vec_start,
+(unsigned long long)bio->bi_iter.bi_sector);
+   bio_put(bio);
+   bio = ERR_PTR(-EFAULT);
+   return bio;
+   }

cs += vec_len / (PAGE_SIZE/spp);
vec_start = 0;
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 2/2] ocfs2/dlm: clean up unused stack variable in dlm_do_local_ast()

2018-03-15 Thread piaojun



On 2018/3/15 20:24, Changwei Ge wrote:
> Signed-off-by: Changwei Ge 
Acked-by: Jun Piao 
> ---
>  fs/ocfs2/dlm/dlmast.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c
> index fd6..39831fc 100644
> --- a/fs/ocfs2/dlm/dlmast.c
> +++ b/fs/ocfs2/dlm/dlmast.c
> @@ -224,14 +224,12 @@ void dlm_do_local_ast(struct dlm_ctxt *dlm, struct 
> dlm_lock_resource *res,
> struct dlm_lock *lock)
>  {
>   dlm_astlockfunc_t *fn;
> - struct dlm_lockstatus *lksb;
>  
>   mlog(0, "%s: res %.*s, lock %u:%llu, Local AST\n", dlm->name,
>res->lockname.len, res->lockname.name,
>dlm_get_lock_cookie_node(be64_to_cpu(lock->ml.cookie)),
>dlm_get_lock_cookie_seq(be64_to_cpu(lock->ml.cookie)));
>  
> - lksb = lock->lksb;
>   fn = lock->ast;
>   BUG_ON(lock->ml.node != dlm->node_num);
>  
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/2] ocfs2/dlm: clean up unused argument for dlm_destroy_recovery_area()

2018-03-15 Thread piaojun



On 2018/3/15 20:24, Changwei Ge wrote:
> Signed-off-by: Changwei Ge 
Acked-by: Jun Piao 
> ---
>  fs/ocfs2/dlm/dlmrecovery.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index ec8f758..b36808c 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -62,7 +62,7 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 
> dead_node);
>  static int dlm_init_recovery_area(struct dlm_ctxt *dlm, u8 dead_node);
>  static int dlm_request_all_locks(struct dlm_ctxt *dlm,
>u8 request_from, u8 dead_node);
> -static void dlm_destroy_recovery_area(struct dlm_ctxt *dlm, u8 dead_node);
> +static void dlm_destroy_recovery_area(struct dlm_ctxt *dlm);
>  
>  static inline int dlm_num_locks_in_lockres(struct dlm_lock_resource *res);
>  static void dlm_init_migratable_lockres(struct dlm_migratable_lockres *mres,
> @@ -739,7 +739,7 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 
> dead_node)
>   }
>  
>   if (destroy)
> - dlm_destroy_recovery_area(dlm, dead_node);
> + dlm_destroy_recovery_area(dlm);
>  
>   return status;
>  }
> @@ -764,7 +764,7 @@ static int dlm_init_recovery_area(struct dlm_ctxt *dlm, 
> u8 dead_node)
>  
>   ndata = kzalloc(sizeof(*ndata), GFP_NOFS);
>   if (!ndata) {
> - dlm_destroy_recovery_area(dlm, dead_node);
> + dlm_destroy_recovery_area(dlm);
>   return -ENOMEM;
>   }
>   ndata->node_num = num;
> @@ -778,7 +778,7 @@ static int dlm_init_recovery_area(struct dlm_ctxt *dlm, 
> u8 dead_node)
>   return 0;
>  }
>  
> -static void dlm_destroy_recovery_area(struct dlm_ctxt *dlm, u8 dead_node)
> +static void dlm_destroy_recovery_area(struct dlm_ctxt *dlm)
>  {
>   struct dlm_reco_node_data *ndata, *next;
>   LIST_HEAD(tmplist);
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2/dlm: wait for dlm recovery done when migrating all lock resources

2018-03-15 Thread piaojun

Wait for dlm recovery done when migrating all lock resources in case that
new lock resource left after leaving dlm domain. And the left lock
resource will cause other nodes BUG.

  NodeA   NodeBNodeC

umount:
  dlm_unregister_domain()
dlm_migrate_all_locks()

 NodeB down

do recovery for NodeB
and collect a new lockres
form other live nodes:

  dlm_do_recovery
dlm_remaster_locks
  dlm_request_all_locks:

  dlm_mig_lockres_handler
dlm_new_lockres
  __dlm_insert_lockres

at last NodeA become the
master of the new lockres
and leave domain:
  dlm_leave_domain()

  mount:
dlm_join_domain()

  touch file and request
  for the owner of the new
  lockres, but all the
  other nodes said 'NO',
  so NodeC decide to be
  the owner, and send do
  assert msg to other
  nodes:
  dlmlock()
dlm_get_lock_resource()
  dlm_do_assert_master()

  other nodes receive the msg
  and found two masters exist.
  at last cause BUG in
  dlm_assert_master_handler()
  -->BUG();

Fixes: bc9838c4d44a ("dlm: allow dlm do recovery during shutdown")

Signed-off-by: Jun Piao 
Reviewed-by: Alex Chen 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/dlm/dlmcommon.h   |  1 +
 fs/ocfs2/dlm/dlmdomain.c   | 15 +++
 fs/ocfs2/dlm/dlmrecovery.c | 13 ++---
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
index 953c200..d06e27e 100644
--- a/fs/ocfs2/dlm/dlmcommon.h
+++ b/fs/ocfs2/dlm/dlmcommon.h
@@ -140,6 +140,7 @@ struct dlm_ctxt
u8 node_num;
u32 key;
u8  joining_node;
+   u8 migrate_done; /* set to 1 means node has migrated all lock resources 
*/
wait_queue_head_t dlm_join_events;
unsigned long live_nodes_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
unsigned long domain_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index 25b76f0..425081b 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -461,6 +461,19 @@ static int dlm_migrate_all_locks(struct dlm_ctxt *dlm)
cond_resched_lock(>spinlock);
num += n;
}
+
+   if (!num) {
+   if (dlm->reco.state & DLM_RECO_STATE_ACTIVE) {
+   mlog(0, "%s: perhaps there are more lock resources "
+"need to be migrated after dlm recovery\n", 
dlm->name);
+   ret = -EAGAIN;
+   } else {
+   mlog(0, "%s: we won't do dlm recovery after migrating "
+"all lock resources\n", dlm->name);
+   dlm->migrate_done = 1;
+   }
+   }
+
spin_unlock(>spinlock);
wake_up(>dlm_thread_wq);

@@ -2038,6 +2051,8 @@ static struct dlm_ctxt *dlm_alloc_ctxt(const char *domain,
dlm->joining_node = DLM_LOCK_RES_OWNER_UNKNOWN;
init_waitqueue_head(>dlm_join_events);

+   dlm->migrate_done = 0;
+
dlm->reco.new_master = O2NM_INVALID_NODE_NUM;
dlm->reco.dead_node = O2NM_INVALID_NODE_NUM;

diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index 505ab42..d4336a5 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -423,12 +423,11 @@ void dlm_wait_for_recovery(struct dlm_ctxt *dlm)

 static void dlm_begin_recovery(struct dlm_ctxt *dlm)
 {
-   spin_lock(>spinlock);
+   assert_spin_locked(>spinlock);
BUG_ON(dlm->reco.state & DLM_RECO_STATE_ACTIVE);
printk(KERN_NOTICE "o2dlm: Begin recovery on domain %s for node %u\n",
   dlm->name, dlm->reco.dead_node);
dlm->reco.state |= DLM_RECO_STATE_ACTIVE;
-   spin_unlock(>spinlock);
 }

 static void dlm_end_recovery(struct dlm_ctxt *dlm)
@@ -456,6 +455,13 @@ static int dlm_do_recovery(struct dlm_ctxt *dlm)

spin_lock(>spinlock);

+   if (dlm->migrate_done) {
+   mlog(0, "%s: no need do recovery after migrating all "
+

[Ocfs2-devel] [PATCH v3] ocfs2/dlm: don't handle migrate lockres if already in shutdown

2018-03-04 Thread piaojun

We should not handle migrate lockres if we are already in
'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
dlm domain. At last other nodes will get stuck into infinite loop when
requsting lock from us.

The problem is caused by concurrency umount between nodes. Before
receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
migrate target. So N2 will continue sending lockres to N1 even though N1
has left domain.

N1 N2 (owner)
   touch file

access the file,
and get pr lock

   begin leave domain and
   pick up N1 as new owner

begin leave domain and
migrate all lockres done

   begin migrate lockres to N1

end leave domain, but
the lockres left
unexpectedly, because
migrate task has passed

Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
Reviewed-by: Joseph Qi 
---
 fs/ocfs2/dlm/dlmdomain.c   | 14 --
 fs/ocfs2/dlm/dlmdomain.h   | 25 -
 fs/ocfs2/dlm/dlmrecovery.c |  9 +
 3 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index e1fea14..25b76f0 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -675,20 +675,6 @@ static void dlm_leave_domain(struct dlm_ctxt *dlm)
spin_unlock(>spinlock);
 }

-int dlm_shutting_down(struct dlm_ctxt *dlm)
-{
-   int ret = 0;
-
-   spin_lock(_domain_lock);
-
-   if (dlm->dlm_state == DLM_CTXT_IN_SHUTDOWN)
-   ret = 1;
-
-   spin_unlock(_domain_lock);
-
-   return ret;
-}
-
 void dlm_unregister_domain(struct dlm_ctxt *dlm)
 {
int leave = 0;
diff --git a/fs/ocfs2/dlm/dlmdomain.h b/fs/ocfs2/dlm/dlmdomain.h
index fd6122a..8a92814 100644
--- a/fs/ocfs2/dlm/dlmdomain.h
+++ b/fs/ocfs2/dlm/dlmdomain.h
@@ -28,7 +28,30 @@
 extern spinlock_t dlm_domain_lock;
 extern struct list_head dlm_domains;

-int dlm_shutting_down(struct dlm_ctxt *dlm);
+static inline int dlm_joined(struct dlm_ctxt *dlm)
+{
+   int ret = 0;
+
+   spin_lock(_domain_lock);
+   if (dlm->dlm_state == DLM_CTXT_JOINED)
+   ret = 1;
+   spin_unlock(_domain_lock);
+
+   return ret;
+}
+
+static inline int dlm_shutting_down(struct dlm_ctxt *dlm)
+{
+   int ret = 0;
+
+   spin_lock(_domain_lock);
+   if (dlm->dlm_state == DLM_CTXT_IN_SHUTDOWN)
+   ret = 1;
+   spin_unlock(_domain_lock);
+
+   return ret;
+}
+
 void dlm_fire_domain_eviction_callbacks(struct dlm_ctxt *dlm,
int node_num);

diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index ec8f758..505ab42 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -1378,6 +1378,15 @@ int dlm_mig_lockres_handler(struct o2net_msg *msg, u32 
len, void *data,
if (!dlm_grab(dlm))
return -EINVAL;

+   if (!dlm_joined(dlm)) {
+   mlog(ML_ERROR, "Domain %s not joined! "
+ "lockres %.*s, master %u\n",
+ dlm->name, mres->lockname_len,
+ mres->lockname, mres->master);
+   dlm_put(dlm);
+   return -EINVAL;
+   }
+
BUG_ON(!(mres->flags & (DLM_MRES_RECOVERY|DLM_MRES_MIGRATION)));

real_master = mres->master;
-- 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2/dlm: don't handle migrate lockres if already in shutdown

2018-03-04 Thread piaojun

Hi Joseph,

On 2018/3/5 9:27, Joseph Qi wrote:
> 
> 
> On 18/3/3 08:45, piaojun wrote:
>> We should not handle migrate lockres if we are already in
>> 'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
>> dlm domain. At last other nodes will get stuck into infinite loop when
>> requsting lock from us.
>>
>> The problem is caused by concurrency umount between nodes. Before
>> receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
>> migrate target. So N2 will continue sending lockres to N1 even though N1
>> has left domain.
>>
>> N1 N2 (owner)
>>touch file
>>
>> access the file,
>> and get pr lock
>>
>>begin leave domain and
>>pick up N1 as new owner
>>
>> begin leave domain and
>> migrate all lockres done
>>
>>begin migrate lockres to N1
>>
>> end leave domain, but
>> the lockres left
>> unexpectedly, because
>> migrate task has passed
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com> ---
>>  fs/ocfs2/dlm/dlmdomain.c   | 14 ++
>>  fs/ocfs2/dlm/dlmdomain.h   |  1 +
>>  fs/ocfs2/dlm/dlmrecovery.c |  9 +
>>  3 files changed, 24 insertions(+)
>>
>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
>> index e1fea14..3b7ec51 100644
>> --- a/fs/ocfs2/dlm/dlmdomain.c
>> +++ b/fs/ocfs2/dlm/dlmdomain.c
>> @@ -675,6 +675,20 @@ static void dlm_leave_domain(struct dlm_ctxt *dlm)
>>  spin_unlock(>spinlock);
>>  }
>>
>> +int dlm_joined(struct dlm_ctxt *dlm)
> This helper can be static inline and placed into header.
Agree, and I decide move dlm_shutting_down() into dlmdomain.h together.

thansk,
Jun
> 
>> +{
>> +int ret = 0;
>> +
>> +spin_lock(_domain_lock);
>> +
> Delete blank line here.
> 
>> +if (dlm->dlm_state == DLM_CTXT_JOINED)
>> +ret = 1;
>> +
> Also here.
> 
> Except the above concern, it looks good to me.
> With they are fixed, feel free to add:
> 
> Reviewed-by: Joseph Qi <jiangqi...@gmail.com>
> 
>> +spin_unlock(_domain_lock);
>> +
>> +return ret;
>> +}
>> +
>>  int dlm_shutting_down(struct dlm_ctxt *dlm)
>>  {
>>  int ret = 0;
>> diff --git a/fs/ocfs2/dlm/dlmdomain.h b/fs/ocfs2/dlm/dlmdomain.h
>> index fd6122a..2f7f60b 100644
>> --- a/fs/ocfs2/dlm/dlmdomain.h
>> +++ b/fs/ocfs2/dlm/dlmdomain.h
>> @@ -28,6 +28,7 @@
>>  extern spinlock_t dlm_domain_lock;
>>  extern struct list_head dlm_domains;
>>
>> +int dlm_joined(struct dlm_ctxt *dlm);
>>  int dlm_shutting_down(struct dlm_ctxt *dlm);
>>  void dlm_fire_domain_eviction_callbacks(struct dlm_ctxt *dlm,
>>  int node_num);
>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
>> index ec8f758..505ab42 100644
>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>> @@ -1378,6 +1378,15 @@ int dlm_mig_lockres_handler(struct o2net_msg *msg, 
>> u32 len, void *data,
>>  if (!dlm_grab(dlm))
>>  return -EINVAL;
>>
>> +if (!dlm_joined(dlm)) {
>> +mlog(ML_ERROR, "Domain %s not joined! "
>> +  "lockres %.*s, master %u\n",
>> +  dlm->name, mres->lockname_len,
>> +  mres->lockname, mres->master);
>> +dlm_put(dlm);
>> +return -EINVAL;
>> +}
>> +
>>  BUG_ON(!(mres->flags & (DLM_MRES_RECOVERY|DLM_MRES_MIGRATION)));
>>
>>  real_master = mres->master;
>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH v2] ocfs2/dlm: don't handle migrate lockres if already in shutdown

2018-03-02 Thread piaojun

We should not handle migrate lockres if we are already in
'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
dlm domain. At last other nodes will get stuck into infinite loop when
requsting lock from us.

The problem is caused by concurrency umount between nodes. Before
receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
migrate target. So N2 will continue sending lockres to N1 even though N1
has left domain.

N1 N2 (owner)
   touch file

access the file,
and get pr lock

   begin leave domain and
   pick up N1 as new owner

begin leave domain and
migrate all lockres done

   begin migrate lockres to N1

end leave domain, but
the lockres left
unexpectedly, because
migrate task has passed

Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/dlm/dlmdomain.c   | 14 ++
 fs/ocfs2/dlm/dlmdomain.h   |  1 +
 fs/ocfs2/dlm/dlmrecovery.c |  9 +
 3 files changed, 24 insertions(+)

diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index e1fea14..3b7ec51 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -675,6 +675,20 @@ static void dlm_leave_domain(struct dlm_ctxt *dlm)
spin_unlock(>spinlock);
 }

+int dlm_joined(struct dlm_ctxt *dlm)
+{
+   int ret = 0;
+
+   spin_lock(_domain_lock);
+
+   if (dlm->dlm_state == DLM_CTXT_JOINED)
+   ret = 1;
+
+   spin_unlock(_domain_lock);
+
+   return ret;
+}
+
 int dlm_shutting_down(struct dlm_ctxt *dlm)
 {
int ret = 0;
diff --git a/fs/ocfs2/dlm/dlmdomain.h b/fs/ocfs2/dlm/dlmdomain.h
index fd6122a..2f7f60b 100644
--- a/fs/ocfs2/dlm/dlmdomain.h
+++ b/fs/ocfs2/dlm/dlmdomain.h
@@ -28,6 +28,7 @@
 extern spinlock_t dlm_domain_lock;
 extern struct list_head dlm_domains;

+int dlm_joined(struct dlm_ctxt *dlm);
 int dlm_shutting_down(struct dlm_ctxt *dlm);
 void dlm_fire_domain_eviction_callbacks(struct dlm_ctxt *dlm,
int node_num);
diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index ec8f758..505ab42 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -1378,6 +1378,15 @@ int dlm_mig_lockres_handler(struct o2net_msg *msg, u32 
len, void *data,
if (!dlm_grab(dlm))
return -EINVAL;

+   if (!dlm_joined(dlm)) {
+   mlog(ML_ERROR, "Domain %s not joined! "
+ "lockres %.*s, master %u\n",
+ dlm->name, mres->lockname_len,
+ mres->lockname, mres->master);
+   dlm_put(dlm);
+   return -EINVAL;
+   }
+
BUG_ON(!(mres->flags & (DLM_MRES_RECOVERY|DLM_MRES_MIGRATION)));

real_master = mres->master;
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: don't handle migrate lockres if already in shutdown

2018-03-02 Thread piaojun

Hi Andrew,

Thanks for your suggestion, I will give a more comprehensive changelog in
patch v2 later.

thanks,
Jun

On 2018/3/2 7:29, Andrew Morton wrote:
> On Thu, 1 Mar 2018 20:37:50 +0800 piaojun <piao...@huawei.com> wrote:
> 
>> Hi Changwei,
>>
>> Thanks for your quick reply, please see my comments below.
>>
>> On 2018/3/1 17:39, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2018/3/1 17:27, piaojun wrote:
>>>> We should not handle migrate lockres if we are already in
>>>> 'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after
>>>> leaving dlm domain. At last other nodes will get stuck into infinite
>>>> loop when requsting lock from us.
>>>>
>>>>  N1 N2 (owner)
>>>> touch file
>>>>
>>>> access the file,
>>>> and get pr lock
>>>>
>>>> umount
>>>>
>>>
>>> Before migrating all lock resources, N1 should have already sent 
>>> DLM_BEGIN_EXIT_DOMAIN_MSG in dlm_begin_exit_domain().
>>> N2 will set ->exit_domain_map later.
>>> So N2 can't take N1 as migration target.
>> Before receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as
>> the migrate target. So N2 will continue sending lockres to N1 even though
>> N1 has left domain. Sorry for making you misunderstanding, I will give a
>> more detailed description.
>>
>> N1 N2 (owner)
>>touch file
>>
>> access the file,
>> and get pr lock
>>
>>begin leave domain and
>>pick up N1 as new owner
>>
>> begin leave domain and
>> migrate all lockres done
>>
>>begin migrate lockres to N1
>>
>> end leave domain, but
>> the lockres left
>> unexpectedly, because
>> migrate task has passed
> 
> If someone asked a question then this is a sign that the changelog was
> missing details.  So please do send along a v2 with a more
> comprehensive changelog.
> 
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: don't handle migrate lockres if already in shutdown

2018-03-02 Thread piaojun

Hi Changwei,

On 2018/3/2 13:53, Changwei Ge wrote:
> On 2018/3/2 10:09, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/3/2 9:49, Changwei Ge wrote:
>>> Hi Jun,
>>> I still have some doubts about your problematic situation description .
>>> Please check my reply inline your sequence diagram.
>>>
>>> On 2018/3/1 20:38, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> Thanks for your quick reply, please see my comments below.
>>>>
>>>> On 2018/3/1 17:39, Changwei Ge wrote:
>>>>> Hi Jun,
>>>>>
>>>>> On 2018/3/1 17:27, piaojun wrote:
>>>>>> We should not handle migrate lockres if we are already in
>>>>>> 'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after
>>>>>> leaving dlm domain. At last other nodes will get stuck into infinite
>>>>>> loop when requsting lock from us.
>>>>>>
>>>>>>N1 N2 (owner)
>>>>>>   touch file
>>>>>>
>>>>>> access the file,
>>>>>> and get pr lock
>>>>>>
>>>>>> umount
>>>>>>
>>>>>
>>>>> Before migrating all lock resources, N1 should have already sent
>>>>> DLM_BEGIN_EXIT_DOMAIN_MSG in dlm_begin_exit_domain().
>>>>> N2 will set ->exit_domain_map later.
>>>>> So N2 can't take N1 as migration target.
>>>> Before receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as
>>>> the migrate target. So N2 will continue sending lockres to N1 even though
>>>> N1 has left domain. Sorry for making you misunderstanding, I will give a
>>>> more detailed description.
>>>>
>>>>   N1 N2 (owner)
>>>>  touch file
>>>>
>>>> access the file,
>>>> and get pr lock
>>>>
>>>>  begin leave domain and
>>>>  pick up N1 as new owner
>>>>
>>>> begin leave domain and
>>> Here will clear N1 from N2's dlm->domain_map
>>>
>>>> migrate all lockres do >
>>>>  begin migrate lockres to N1
>>> As N1 has been cleared, migrating for this lock resource can't continue.
>> N2 has picked up N1 as new owner in dlm_pick_migration_target() before
>> setting N1 in N2's dlm->exit_domain_map, so N2 will continue migrating
>> lock resource.
> 
> Do you mean that DLM_BEGIN_EXIT_DOMAIN_MSG arrives just before 
> dlm_send_one_lockres() in dlm_migrate_lockres() and after 
> dlm_mark_lockres_migrating()?
Yes, I meant that.

> If so there is race that your problem can happen.
> 
> Well let's move to your code.
> 
>>>
>>> Besides,
>>> I wonder what kind of problem you encountered?
>>> A bug crash? A hang situation?
>>> It's better to share some issue appearance.So perhaps I can help to 
>>> analyze.:)
>> I encountered a stuck case which described in my changelog.
>>
>> "
>> At last other nodes will get stuck into infinite loop when requsting lock
>> from us.
>> "
>>
>> thanks,
>> Jun
>>>
>>> Thanks,
>>> Changwei
>>>
>>>>
>>>> end leave domain, but
>>>> the lockres left
>>>> unexpectedly, because
>>>> migrate task has passed
>>>>
>>>> thanks,
>>>> Jun
>>>>>
>>>>> How can the scenario your changelog describing happen?
>>>>> Or if miss something?
>>>>>
>>>>> Thanks,
>>>>> Changwei
>>>>>
>>>>>> migrate all lockres
>>>>>>
>>>>>>   umount and migrate lockres to N1
>>>>>>
>>>>>> leave dlm domain, but
>>>>>> the lockres left
>>>>>> unexpectedly, because
>>>>>> migrate task has passed
>>>>>>
>>>>>> Signed-off-by: Jun Piao <piao...@huawei.com>
>>>>>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
>>>>>> ---
>>>>>> fs/ocfs2/dlm/dlmdomain.c   | 14 ++
>>>>>> fs/ocfs2/dlm/dlmdomain.h   |

Re: [Ocfs2-devel] [PATCH] Correct a comment error

2018-03-01 Thread piaojun

Hi Changwei,

On 2018/3/2 9:59, Changwei Ge wrote:
> Hi Jun,
> I think the comments for both two functions are OK.
> No need to rework them.
> As we know, ocfs2 lock name(lock id) are composed of several parts including 
> block number.
I looked though the comments involved 'lockid', and found 'lockid' is a
concept in dlm level, so ocfs2 level should not be aware of it.

thanks,
Jun
> 
> Thanks,
> Changw2ei
> 
> On 2018/3/1 20:58, piaojun wrote:
>> Hi Larry,
>>
>> There is the same mistake in ocfs2_reflink_inodes_lock(), could you help
>> fixing them all?
>>
>> thanks,
>> Jun
>>
>> On 2018/2/28 18:17, Larry Chen wrote:
>>> The function ocfs2_double_lock tries to lock the inode with lower
>>> blockid first, not lockid.
>>>
>>> Signed-off-by: Larry Chen <lc...@suse.com>
>>> ---
>>>   fs/ocfs2/namei.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
>>> index c801eddc4bf3..30d454de35a8 100644
>>> --- a/fs/ocfs2/namei.c
>>> +++ b/fs/ocfs2/namei.c
>>> @@ -1133,7 +1133,7 @@ static int ocfs2_double_lock(struct ocfs2_super *osb,
>>> if (*bh2)
>>> *bh2 = NULL;
>>>   
>>> -   /* we always want to lock the one with the lower lockid first.
>>> +   /* we always want to lock the one with the lower blockid first.
>>>  * and if they are nested, we lock ancestor first */
>>> if (oi1->ip_blkno != oi2->ip_blkno) {
>>> inode1_is_ancestor = ocfs2_check_if_ancestor(osb, oi2->ip_blkno,
>>>
>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: don't handle migrate lockres if already in shutdown

2018-03-01 Thread piaojun

Hi Changwei,

On 2018/3/2 9:49, Changwei Ge wrote:
> Hi Jun,
> I still have some doubts about your problematic situation description .
> Please check my reply inline your sequence diagram.
> 
> On 2018/3/1 20:38, piaojun wrote:
>> Hi Changwei,
>>
>> Thanks for your quick reply, please see my comments below.
>>
>> On 2018/3/1 17:39, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2018/3/1 17:27, piaojun wrote:
>>>> We should not handle migrate lockres if we are already in
>>>> 'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after
>>>> leaving dlm domain. At last other nodes will get stuck into infinite
>>>> loop when requsting lock from us.
>>>>
>>>>   N1 N2 (owner)
>>>>  touch file
>>>>
>>>> access the file,
>>>> and get pr lock
>>>>
>>>> umount
>>>>
>>>
>>> Before migrating all lock resources, N1 should have already sent
>>> DLM_BEGIN_EXIT_DOMAIN_MSG in dlm_begin_exit_domain().
>>> N2 will set ->exit_domain_map later.
>>> So N2 can't take N1 as migration target.
>> Before receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as
>> the migrate target. So N2 will continue sending lockres to N1 even though
>> N1 has left domain. Sorry for making you misunderstanding, I will give a
>> more detailed description.
>>
>>  N1 N2 (owner)
>> touch file
>>
>> access the file,
>> and get pr lock
>>
>> begin leave domain and
>> pick up N1 as new owner
>>
>> begin leave domain and
> Here will clear N1 from N2's dlm->domain_map
> 
>> migrate all lockres do >
>> begin migrate lockres to N1
> As N1 has been cleared, migrating for this lock resource can't continue.
N2 has picked up N1 as new owner in dlm_pick_migration_target() before
setting N1 in N2's dlm->exit_domain_map, so N2 will continue migrating
lock resource.
> 
> Besides,
> I wonder what kind of problem you encountered?
> A bug crash? A hang situation?
> It's better to share some issue appearance.So perhaps I can help to analyze.:)
I encountered a stuck case which described in my changelog.

"
At last other nodes will get stuck into infinite loop when requsting lock
from us.
"

thanks,
Jun
> 
> Thanks,
> Changwei
> 
>>
>> end leave domain, but
>> the lockres left
>> unexpectedly, because
>> migrate task has passed
>>
>> thanks,
>> Jun
>>>
>>> How can the scenario your changelog describing happen?
>>> Or if miss something?
>>>
>>> Thanks,
>>> Changwei
>>>
>>>> migrate all lockres
>>>>
>>>>  umount and migrate lockres to N1
>>>>
>>>> leave dlm domain, but
>>>> the lockres left
>>>> unexpectedly, because
>>>> migrate task has passed
>>>>
>>>> Signed-off-by: Jun Piao <piao...@huawei.com>
>>>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
>>>> ---
>>>>fs/ocfs2/dlm/dlmdomain.c   | 14 ++
>>>>fs/ocfs2/dlm/dlmdomain.h   |  1 +
>>>>fs/ocfs2/dlm/dlmrecovery.c |  9 +
>>>>3 files changed, 24 insertions(+)
>>>>
>>>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
>>>> index e1fea14..3b7ec51 100644
>>>> --- a/fs/ocfs2/dlm/dlmdomain.c
>>>> +++ b/fs/ocfs2/dlm/dlmdomain.c
>>>> @@ -675,6 +675,20 @@ static void dlm_leave_domain(struct dlm_ctxt *dlm)
>>>>spin_unlock(>spinlock);
>>>>}
>>>>
>>>> +int dlm_joined(struct dlm_ctxt *dlm)
>>>> +{
>>>> +  int ret = 0;
>>>> +
>>>> +  spin_lock(_domain_lock);
>>>> +
>>>> +  if (dlm->dlm_state == DLM_CTXT_JOINED)
>>>> +  ret = 1;
>>>> +
>>>> +  spin_unlock(_domain_lock);
>>>> +
>>>> +  return ret;
>>>> +}
>>>> +
>>>>int dlm_shutting_down(struct dlm_ctxt *dlm)
>>>>{
>>>>int ret = 0;
>>>> diff --git a/fs/ocfs2/dlm/dlmdomain.h b/fs/ocfs2/dlm/dlmdomain.h
>>>> index fd6122a..2f7f60b 100644
&g

Re: [Ocfs2-devel] [PATCH] Correct a comment error

2018-03-01 Thread piaojun

Hi Larry,

There is the same mistake in ocfs2_reflink_inodes_lock(), could you help
fixing them all?

thanks,
Jun

On 2018/2/28 18:17, Larry Chen wrote:
> The function ocfs2_double_lock tries to lock the inode with lower
> blockid first, not lockid.
> 
> Signed-off-by: Larry Chen 
> ---
>  fs/ocfs2/namei.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index c801eddc4bf3..30d454de35a8 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -1133,7 +1133,7 @@ static int ocfs2_double_lock(struct ocfs2_super *osb,
>   if (*bh2)
>   *bh2 = NULL;
>  
> - /* we always want to lock the one with the lower lockid first.
> + /* we always want to lock the one with the lower blockid first.
>* and if they are nested, we lock ancestor first */
>   if (oi1->ip_blkno != oi2->ip_blkno) {
>   inode1_is_ancestor = ocfs2_check_if_ancestor(osb, oi2->ip_blkno,
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: don't handle migrate lockres if already in shutdown

2018-03-01 Thread piaojun

Hi Changwei,

Thanks for your quick reply, please see my comments below.

On 2018/3/1 17:39, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/3/1 17:27, piaojun wrote:
>> We should not handle migrate lockres if we are already in
>> 'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after
>> leaving dlm domain. At last other nodes will get stuck into infinite
>> loop when requsting lock from us.
>>
>>  N1 N2 (owner)
>> touch file
>>
>> access the file,
>> and get pr lock
>>
>> umount
>>
> 
> Before migrating all lock resources, N1 should have already sent 
> DLM_BEGIN_EXIT_DOMAIN_MSG in dlm_begin_exit_domain().
> N2 will set ->exit_domain_map later.
> So N2 can't take N1 as migration target.
Before receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as
the migrate target. So N2 will continue sending lockres to N1 even though
N1 has left domain. Sorry for making you misunderstanding, I will give a
more detailed description.

N1 N2 (owner)
   touch file

access the file,
and get pr lock

   begin leave domain and
   pick up N1 as new owner

begin leave domain and
migrate all lockres done

   begin migrate lockres to N1

end leave domain, but
the lockres left
unexpectedly, because
migrate task has passed

thanks,
Jun
> 
> How can the scenario your changelog describing happen?
> Or if miss something?
> 
> Thanks,
> Changwei
> 
>> migrate all lockres
>>
>> umount and migrate lockres to N1
>>
>> leave dlm domain, but
>> the lockres left
>> unexpectedly, because
>> migrate task has passed
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
>> ---
>>   fs/ocfs2/dlm/dlmdomain.c   | 14 ++
>>   fs/ocfs2/dlm/dlmdomain.h   |  1 +
>>   fs/ocfs2/dlm/dlmrecovery.c |  9 +
>>   3 files changed, 24 insertions(+)
>>
>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
>> index e1fea14..3b7ec51 100644
>> --- a/fs/ocfs2/dlm/dlmdomain.c
>> +++ b/fs/ocfs2/dlm/dlmdomain.c
>> @@ -675,6 +675,20 @@ static void dlm_leave_domain(struct dlm_ctxt *dlm)
>>  spin_unlock(>spinlock);
>>   }
>>
>> +int dlm_joined(struct dlm_ctxt *dlm)
>> +{
>> +int ret = 0;
>> +
>> +spin_lock(_domain_lock);
>> +
>> +if (dlm->dlm_state == DLM_CTXT_JOINED)
>> +ret = 1;
>> +
>> +spin_unlock(_domain_lock);
>> +
>> +return ret;
>> +}
>> +
>>   int dlm_shutting_down(struct dlm_ctxt *dlm)
>>   {
>>  int ret = 0;
>> diff --git a/fs/ocfs2/dlm/dlmdomain.h b/fs/ocfs2/dlm/dlmdomain.h
>> index fd6122a..2f7f60b 100644
>> --- a/fs/ocfs2/dlm/dlmdomain.h
>> +++ b/fs/ocfs2/dlm/dlmdomain.h
>> @@ -28,6 +28,7 @@
>>   extern spinlock_t dlm_domain_lock;
>>   extern struct list_head dlm_domains;
>>
>> +int dlm_joined(struct dlm_ctxt *dlm);
>>   int dlm_shutting_down(struct dlm_ctxt *dlm);
>>   void dlm_fire_domain_eviction_callbacks(struct dlm_ctxt *dlm,
>>  int node_num);
>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
>> index ec8f758..9b3bc66 100644
>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>> @@ -1378,6 +1378,15 @@ int dlm_mig_lockres_handler(struct o2net_msg *msg, 
>> u32 len, void *data,
>>  if (!dlm_grab(dlm))
>>  return -EINVAL;
>>
>> +if (!dlm_joined(dlm)) {
>> +mlog(ML_ERROR, "Domain %s not joined! "
>> +"lockres %.*s, master %u\n",
>> +dlm->name, mres->lockname_len,
>> +mres->lockname, mres->master);
>> +dlm_put(dlm);
>> +return -EINVAL;
>> +}
>> +
>>  BUG_ON(!(mres->flags & (DLM_MRES_RECOVERY|DLM_MRES_MIGRATION)));
>>
>>  real_master = mres->master;
>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2/dlm: don't handle migrate lockres if already in shutdown

2018-03-01 Thread piaojun

We should not handle migrate lockres if we are already in
'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after
leaving dlm domain. At last other nodes will get stuck into infinite
loop when requsting lock from us.

N1 N2 (owner)
   touch file

access the file,
and get pr lock

umount

migrate all lockres

   umount and migrate lockres to N1

leave dlm domain, but
the lockres left
unexpectedly, because
migrate task has passed

Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/dlm/dlmdomain.c   | 14 ++
 fs/ocfs2/dlm/dlmdomain.h   |  1 +
 fs/ocfs2/dlm/dlmrecovery.c |  9 +
 3 files changed, 24 insertions(+)

diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index e1fea14..3b7ec51 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -675,6 +675,20 @@ static void dlm_leave_domain(struct dlm_ctxt *dlm)
spin_unlock(>spinlock);
 }

+int dlm_joined(struct dlm_ctxt *dlm)
+{
+   int ret = 0;
+
+   spin_lock(_domain_lock);
+
+   if (dlm->dlm_state == DLM_CTXT_JOINED)
+   ret = 1;
+
+   spin_unlock(_domain_lock);
+
+   return ret;
+}
+
 int dlm_shutting_down(struct dlm_ctxt *dlm)
 {
int ret = 0;
diff --git a/fs/ocfs2/dlm/dlmdomain.h b/fs/ocfs2/dlm/dlmdomain.h
index fd6122a..2f7f60b 100644
--- a/fs/ocfs2/dlm/dlmdomain.h
+++ b/fs/ocfs2/dlm/dlmdomain.h
@@ -28,6 +28,7 @@
 extern spinlock_t dlm_domain_lock;
 extern struct list_head dlm_domains;

+int dlm_joined(struct dlm_ctxt *dlm);
 int dlm_shutting_down(struct dlm_ctxt *dlm);
 void dlm_fire_domain_eviction_callbacks(struct dlm_ctxt *dlm,
int node_num);
diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index ec8f758..9b3bc66 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -1378,6 +1378,15 @@ int dlm_mig_lockres_handler(struct o2net_msg *msg, u32 
len, void *data,
if (!dlm_grab(dlm))
return -EINVAL;

+   if (!dlm_joined(dlm)) {
+   mlog(ML_ERROR, "Domain %s not joined! "
+   "lockres %.*s, master %u\n",
+   dlm->name, mres->lockname_len,
+   mres->lockname, mres->master);
+   dlm_put(dlm);
+   return -EINVAL;
+   }
+
BUG_ON(!(mres->flags & (DLM_MRES_RECOVERY|DLM_MRES_MIGRATION)));

real_master = mres->master;
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: clean unrelated comment

2018-02-23 Thread piaojun

Hi Changwei,

On 2018/2/23 17:55, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/2/23 17:13, piaojun wrote:
>> Hi changwei,
>>
>> On 2018/2/23 15:30, ge.chang...@h3c.com wrote:
>>> From: Changwei Ge <ge.chang...@h3c.com>
>> This line seems unnecessary, others looks good to me.
> I used git send-email, thus this line was added by parameter --from.
> I suppose you could ignore it if it seems strange to you:-)
If this won't affect the patch format, you could add:
Reviewed-by: Jun Piao <piao...@huawei.com>

thanks,
Jun
> 
> Thanks,
> Changwei
> 
>>
>> thanks,
>> Jun
>>>
>>> Obviously, the comment before dlm_do_local_recovery_cleanup() has
>>> nothing to do with it. So remove it.
>>>
>>> Signed-off-by: Changwei Ge <ge.chang...@h3c.com>
>>> ---
>>>   fs/ocfs2/dlm/dlmrecovery.c | 7 ---
>>>   1 file changed, 7 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
>>> index ec8f758..65f0c54 100644
>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>> @@ -2331,13 +2331,6 @@ static void dlm_free_dead_locks(struct dlm_ctxt *dlm,
>>> __dlm_dirty_lockres(dlm, res);
>>>   }
>>>   
>>> -/* if this node is the recovery master, and there are no
>>> - * locks for a given lockres owned by this node that are in
>>> - * either PR or EX mode, zero out the lvb before requesting.
>>> - *
>>> - */
>>> -
>>> -
>>>   static void dlm_do_local_recovery_cleanup(struct dlm_ctxt *dlm, u8 
>>> dead_node)
>>>   {
>>> struct dlm_lock_resource *res;
>>>
>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: clean unrelated comment

2018-02-23 Thread piaojun

Hi changwei,

On 2018/2/23 15:30, ge.chang...@h3c.com wrote:
> From: Changwei Ge 
This line seems unnecessary, others looks good to me.

thanks,
Jun
> 
> Obviously, the comment before dlm_do_local_recovery_cleanup() has
> nothing to do with it. So remove it.
> 
> Signed-off-by: Changwei Ge 
> ---
>  fs/ocfs2/dlm/dlmrecovery.c | 7 ---
>  1 file changed, 7 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index ec8f758..65f0c54 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2331,13 +2331,6 @@ static void dlm_free_dead_locks(struct dlm_ctxt *dlm,
>   __dlm_dirty_lockres(dlm, res);
>  }
>  
> -/* if this node is the recovery master, and there are no
> - * locks for a given lockres owned by this node that are in
> - * either PR or EX mode, zero out the lvb before requesting.
> - *
> - */
> -
> -
>  static void dlm_do_local_recovery_cleanup(struct dlm_ctxt *dlm, u8 dead_node)
>  {
>   struct dlm_lock_resource *res;
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2: clean up some unused function declaration

2018-02-08 Thread piaojun

Clean up some unused function declaration in dlmcommon.h

Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/dlm/dlmcommon.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
index e9f3705..953c200 100644
--- a/fs/ocfs2/dlm/dlmcommon.h
+++ b/fs/ocfs2/dlm/dlmcommon.h
@@ -960,13 +960,10 @@ static inline int dlm_send_proxy_ast(struct dlm_ctxt *dlm,
 void dlm_print_one_lock_resource(struct dlm_lock_resource *res);
 void __dlm_print_one_lock_resource(struct dlm_lock_resource *res);

-u8 dlm_nm_this_node(struct dlm_ctxt *dlm);
 void dlm_kick_thread(struct dlm_ctxt *dlm, struct dlm_lock_resource *res);
 void __dlm_dirty_lockres(struct dlm_ctxt *dlm, struct dlm_lock_resource *res);


-int dlm_nm_init(struct dlm_ctxt *dlm);
-int dlm_heartbeat_init(struct dlm_ctxt *dlm);
 void dlm_hb_node_down_cb(struct o2nm_node *node, int idx, void *data);
 void dlm_hb_node_up_cb(struct o2nm_node *node, int idx, void *data);

-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/2] ocfs2: use 'oi' instead of 'OCFS2_I()'

2018-02-06 Thread piaojun

Hi Andrew,

Could you help reviewing my patch set and applying?

Thanks a lot,
Jun

On 2018/1/30 15:38, piaojun wrote:
> We could use 'oi' instead of 'OCFS2_I()' to make code more elegant.
> 
> Signed-off-by: Jun Piao <piao...@huawei.com>
> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
> ---
>  fs/ocfs2/alloc.c| 2 +-
>  fs/ocfs2/aops.c | 2 +-
>  fs/ocfs2/file.c | 6 +++---
>  fs/ocfs2/inode.c| 2 +-
>  fs/ocfs2/namei.c| 6 +++---
>  fs/ocfs2/refcounttree.c | 6 +++---
>  6 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
> index ab5105f..8ee4bd8 100644
> --- a/fs/ocfs2/alloc.c
> +++ b/fs/ocfs2/alloc.c
> @@ -6940,7 +6940,7 @@ int ocfs2_convert_inline_data_to_extents(struct inode 
> *inode,
>   goto out_commit;
>   did_quota = 1;
> 
> - data_ac->ac_resv = _I(inode)->ip_la_data_resv;
> + data_ac->ac_resv = >ip_la_data_resv;
> 
>   ret = ocfs2_claim_clusters(handle, data_ac, 1, _off,
>  );
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index d151632..4dae836 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -346,7 +346,7 @@ static int ocfs2_readpage(struct file *file, struct page 
> *page)
>   unlock = 0;
> 
>  out_alloc:
> - up_read(_I(inode)->ip_alloc_sem);
> + up_read(>ip_alloc_sem);
>  out_inode_unlock:
>   ocfs2_inode_unlock(inode, 0);
>  out:
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index dc455d4..2188af4 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -101,7 +101,7 @@ static int ocfs2_file_open(struct inode *inode, struct 
> file *file)
>   struct ocfs2_inode_info *oi = OCFS2_I(inode);
> 
>   trace_ocfs2_file_open(inode, file, file->f_path.dentry,
> -   (unsigned long long)OCFS2_I(inode)->ip_blkno,
> +   (unsigned long long)oi->ip_blkno,
> file->f_path.dentry->d_name.len,
> file->f_path.dentry->d_name.name, mode);
> 
> @@ -116,7 +116,7 @@ static int ocfs2_file_open(struct inode *inode, struct 
> file *file)
>   /* Check that the inode hasn't been wiped from disk by another
>* node. If it hasn't then we're safe as long as we hold the
>* spin lock until our increment of open count. */
> - if (OCFS2_I(inode)->ip_flags & OCFS2_INODE_DELETED) {
> + if (oi->ip_flags & OCFS2_INODE_DELETED) {
>   spin_unlock(>ip_lock);
> 
>   status = -ENOENT;
> @@ -188,7 +188,7 @@ static int ocfs2_sync_file(struct file *file, loff_t 
> start, loff_t end,
>   bool needs_barrier = false;
> 
>   trace_ocfs2_sync_file(inode, file, file->f_path.dentry,
> -   OCFS2_I(inode)->ip_blkno,
> +   oi->ip_blkno,
> file->f_path.dentry->d_name.len,
> file->f_path.dentry->d_name.name,
> (unsigned long long)datasync);
> diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
> index 1a1e007..2c48395 100644
> --- a/fs/ocfs2/inode.c
> +++ b/fs/ocfs2/inode.c
> @@ -1159,7 +1159,7 @@ static void ocfs2_clear_inode(struct inode *inode)
>* exception here are successfully wiped inodes - their
>* metadata can now be considered to be part of the system
>* inodes from which it came. */
> - if (!(OCFS2_I(inode)->ip_flags & OCFS2_INODE_DELETED))
> + if (!(oi->ip_flags & OCFS2_INODE_DELETED))
>   ocfs2_checkpoint_inode(inode);
> 
>   mlog_bug_on_msg(!list_empty(>ip_io_markers),
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index 3b0a10d..e1208d1 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -524,7 +524,7 @@ static int __ocfs2_mknod_locked(struct inode *dir,
>* these are used by the support functions here and in
>* callers. */
>   inode->i_ino = ino_from_blkno(osb->sb, fe_blkno);
> - OCFS2_I(inode)->ip_blkno = fe_blkno;
> + oi->ip_blkno = fe_blkno;
>   spin_lock(>osb_lock);
>   inode->i_generation = osb->s_next_generation++;
>   spin_unlock(>osb_lock);
> @@ -1185,8 +1185,8 @@ static int ocfs2_double_lock(struct ocfs2_super *osb,
>   }
> 
>   trace_ocfs2_double_lock_end(
> - (unsigned long long)OCFS2_I(inode1)->ip_blkno,
> - (unsigned long long)OCFS2_I(inode2)->ip_blkno);
> +

[Ocfs2-devel] [PATCH 2/2] ocfs2: use 'osb' instead of 'OCFS2_SB()'

2018-01-29 Thread piaojun

We could use 'osb' instead of 'OCFS2_SB()' to make code more elegant.

Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/aops.c |  2 +-
 fs/ocfs2/dir.c  |  2 +-
 fs/ocfs2/dlmglue.c  | 21 -
 fs/ocfs2/file.c |  2 +-
 fs/ocfs2/inode.c|  6 +++---
 fs/ocfs2/refcounttree.c |  4 ++--
 fs/ocfs2/suballoc.c |  4 ++--
 fs/ocfs2/super.c|  4 ++--
 fs/ocfs2/xattr.c|  2 +-
 9 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 4dae836..e9c2360 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2211,7 +2211,7 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, 
sector_t iblock,
down_write(>ip_alloc_sem);

if (first_get_block) {
-   if (ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
+   if (ocfs2_sparse_alloc(osb))
ret = ocfs2_zero_tail(inode, di_bh, pos);
else
ret = ocfs2_expand_nonsparse_inode(inode, di_bh, pos,
diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index febe631..0a38408 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -3071,7 +3071,7 @@ static int ocfs2_expand_inline_dir(struct inode *dir, 
struct buffer_head *di_bh,
 * We need to return the correct block within the
 * cluster which should hold our entry.
 */
-   off = ocfs2_dx_dir_hash_idx(OCFS2_SB(dir->i_sb),
+   off = ocfs2_dx_dir_hash_idx(osb,
>dl_hinfo);
get_bh(dx_leaves[off]);
lookup->dl_dx_leaf_bh = dx_leaves[off];
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 4689940..9f937a2 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -1734,8 +1734,7 @@ int ocfs2_rw_lock(struct inode *inode, int write)

level = write ? DLM_LOCK_EX : DLM_LOCK_PR;

-   status = ocfs2_cluster_lock(OCFS2_SB(inode->i_sb), lockres, level, 0,
-   0);
+   status = ocfs2_cluster_lock(osb, lockres, level, 0, 0);
if (status < 0)
mlog_errno(status);

@@ -1753,7 +1752,7 @@ void ocfs2_rw_unlock(struct inode *inode, int write)
 write ? "EXMODE" : "PRMODE");

if (!ocfs2_mount_local(osb))
-   ocfs2_cluster_unlock(OCFS2_SB(inode->i_sb), lockres, level);
+   ocfs2_cluster_unlock(osb, lockres, level);
 }

 /*
@@ -1773,8 +1772,7 @@ int ocfs2_open_lock(struct inode *inode)

lockres = _I(inode)->ip_open_lockres;

-   status = ocfs2_cluster_lock(OCFS2_SB(inode->i_sb), lockres,
-   DLM_LOCK_PR, 0, 0);
+   status = ocfs2_cluster_lock(osb, lockres, DLM_LOCK_PR, 0, 0);
if (status < 0)
mlog_errno(status);

@@ -1811,8 +1809,7 @@ int ocfs2_try_open_lock(struct inode *inode, int write)
 * other nodes and the -EAGAIN will indicate to the caller that
 * this inode is still in use.
 */
-   status = ocfs2_cluster_lock(OCFS2_SB(inode->i_sb), lockres,
-   level, DLM_LKF_NOQUEUE, 0);
+   status = ocfs2_cluster_lock(osb, lockres, level, DLM_LKF_NOQUEUE, 0);

 out:
return status;
@@ -1833,11 +1830,9 @@ void ocfs2_open_unlock(struct inode *inode)
goto out;

if(lockres->l_ro_holders)
-   ocfs2_cluster_unlock(OCFS2_SB(inode->i_sb), lockres,
-DLM_LOCK_PR);
+   ocfs2_cluster_unlock(osb, lockres, DLM_LOCK_PR);
if(lockres->l_ex_holders)
-   ocfs2_cluster_unlock(OCFS2_SB(inode->i_sb), lockres,
-DLM_LOCK_EX);
+   ocfs2_cluster_unlock(osb, lockres, DLM_LOCK_EX);

 out:
return;
@@ -2539,9 +2534,9 @@ void ocfs2_inode_unlock(struct inode *inode,
 (unsigned long long)OCFS2_I(inode)->ip_blkno,
 ex ? "EXMODE" : "PRMODE");

-   if (!ocfs2_is_hard_readonly(OCFS2_SB(inode->i_sb)) &&
+   if (!ocfs2_is_hard_readonly(osb) &&
!ocfs2_mount_local(osb))
-   ocfs2_cluster_unlock(OCFS2_SB(inode->i_sb), lockres, level);
+   ocfs2_cluster_unlock(osb, lockres, level);
 }

 /*
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 2188af4..9c9388f 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -294,7 +294,7 @@ int ocfs2_update_inode_atime(struct inode *inode,
ocfs2_journal_dirty(handle, bh);

 out_commit:
-   ocfs2_commit_trans(OCFS2_SB(inode->i_sb), handle);
+   ocfs2_commit_trans(osb, handle);
 out:
return ret;
 }
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 2c48395..3cfbce0 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1134,7 +1134,7 @@ static void ocfs2_clear_inode(struct inode *inode)

[Ocfs2-devel] [PATCH 1/2] ocfs2: use 'oi' instead of 'OCFS2_I()'

2018-01-29 Thread piaojun

We could use 'oi' instead of 'OCFS2_I()' to make code more elegant.

Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/alloc.c| 2 +-
 fs/ocfs2/aops.c | 2 +-
 fs/ocfs2/file.c | 6 +++---
 fs/ocfs2/inode.c| 2 +-
 fs/ocfs2/namei.c| 6 +++---
 fs/ocfs2/refcounttree.c | 6 +++---
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index ab5105f..8ee4bd8 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -6940,7 +6940,7 @@ int ocfs2_convert_inline_data_to_extents(struct inode 
*inode,
goto out_commit;
did_quota = 1;

-   data_ac->ac_resv = _I(inode)->ip_la_data_resv;
+   data_ac->ac_resv = >ip_la_data_resv;

ret = ocfs2_claim_clusters(handle, data_ac, 1, _off,
   );
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index d151632..4dae836 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -346,7 +346,7 @@ static int ocfs2_readpage(struct file *file, struct page 
*page)
unlock = 0;

 out_alloc:
-   up_read(_I(inode)->ip_alloc_sem);
+   up_read(>ip_alloc_sem);
 out_inode_unlock:
ocfs2_inode_unlock(inode, 0);
 out:
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index dc455d4..2188af4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -101,7 +101,7 @@ static int ocfs2_file_open(struct inode *inode, struct file 
*file)
struct ocfs2_inode_info *oi = OCFS2_I(inode);

trace_ocfs2_file_open(inode, file, file->f_path.dentry,
- (unsigned long long)OCFS2_I(inode)->ip_blkno,
+ (unsigned long long)oi->ip_blkno,
  file->f_path.dentry->d_name.len,
  file->f_path.dentry->d_name.name, mode);

@@ -116,7 +116,7 @@ static int ocfs2_file_open(struct inode *inode, struct file 
*file)
/* Check that the inode hasn't been wiped from disk by another
 * node. If it hasn't then we're safe as long as we hold the
 * spin lock until our increment of open count. */
-   if (OCFS2_I(inode)->ip_flags & OCFS2_INODE_DELETED) {
+   if (oi->ip_flags & OCFS2_INODE_DELETED) {
spin_unlock(>ip_lock);

status = -ENOENT;
@@ -188,7 +188,7 @@ static int ocfs2_sync_file(struct file *file, loff_t start, 
loff_t end,
bool needs_barrier = false;

trace_ocfs2_sync_file(inode, file, file->f_path.dentry,
- OCFS2_I(inode)->ip_blkno,
+ oi->ip_blkno,
  file->f_path.dentry->d_name.len,
  file->f_path.dentry->d_name.name,
  (unsigned long long)datasync);
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 1a1e007..2c48395 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1159,7 +1159,7 @@ static void ocfs2_clear_inode(struct inode *inode)
 * exception here are successfully wiped inodes - their
 * metadata can now be considered to be part of the system
 * inodes from which it came. */
-   if (!(OCFS2_I(inode)->ip_flags & OCFS2_INODE_DELETED))
+   if (!(oi->ip_flags & OCFS2_INODE_DELETED))
ocfs2_checkpoint_inode(inode);

mlog_bug_on_msg(!list_empty(>ip_io_markers),
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3b0a10d..e1208d1 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -524,7 +524,7 @@ static int __ocfs2_mknod_locked(struct inode *dir,
 * these are used by the support functions here and in
 * callers. */
inode->i_ino = ino_from_blkno(osb->sb, fe_blkno);
-   OCFS2_I(inode)->ip_blkno = fe_blkno;
+   oi->ip_blkno = fe_blkno;
spin_lock(>osb_lock);
inode->i_generation = osb->s_next_generation++;
spin_unlock(>osb_lock);
@@ -1185,8 +1185,8 @@ static int ocfs2_double_lock(struct ocfs2_super *osb,
}

trace_ocfs2_double_lock_end(
-   (unsigned long long)OCFS2_I(inode1)->ip_blkno,
-   (unsigned long long)OCFS2_I(inode2)->ip_blkno);
+   (unsigned long long)oi1->ip_blkno,
+   (unsigned long long)oi2->ip_blkno);

 bail:
if (status)
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index ab156e3..50e288e 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -573,7 +573,7 @@ static int ocfs2_create_refcount_tree(struct inode *inode,
BUG_ON(ocfs2_is_refcount_inode(inode));

trace_ocfs2_create_refcount_tree(
-   (unsigned long long)OCFS2_I(inode)->ip_blkno);
+   (unsigned long long)oi->ip_blkno);

ret = ocfs2_reserve_new_metadata_blocks(osb, 1, _ac);
if (ret) {
@@ -4766,8 +4766,8 @@ static int ocfs2_reflink_inodes_lock(struct inode

[Ocfs2-devel] [PATCH v3] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-28 Thread piaojun

We should not reuse the dirty bh in jbd2 directly due to the following
situation:

1. When removing extent rec, we will dirty the bhs of extent rec and
   truncate log at the same time, and hand them over to jbd2.
2. The bhs are submitted to jbd2 area successfully.
3. The write-back thread of device help flush the bhs to disk but
   encounter write error due to abnormal storage link.
4. After a while the storage link become normal. Truncate log flush
   worker triggered by the next space reclaiming found the dirty bh of
   truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
   __ocfs2_journal_access():

ocfs2_truncate_log_worker
  ocfs2_flush_truncate_log
__ocfs2_flush_truncate_log
  ocfs2_replay_truncate_records
ocfs2_journal_access_di
  __ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
uptodata.

5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
   extent rec is still in error state, and unfortunately nobody will
   take care of it.
6. At last the space of extent rec was not reduced, but truncate log
   flush worker have given it back to globalalloc. That will cause
   duplicate cluster problem which could be identified by fsck.ocfs2.

Sadlly we can hardly revert this but set fs read-only in case of
ruining atomicity and consistency of space reclaim.

Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
__ocfs2_journal_access")

Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/journal.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 3630443..e5dcea6 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -666,23 +666,24 @@ static int __ocfs2_journal_access(handle_t *handle,
/* we can safely remove this assertion after testing. */
if (!buffer_uptodate(bh)) {
mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
-   mlog(ML_ERROR, "b_blocknr=%llu\n",
-(unsigned long long)bh->b_blocknr);
+   mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
+(unsigned long long)bh->b_blocknr, bh->b_state);

lock_buffer(bh);
/*
-* A previous attempt to write this buffer head failed.
-* Nothing we can do but to retry the write and hope for
-* the best.
+* A previous transaction with a couple of buffer heads fail
+* to checkpoint, so all the bhs are marked as BH_Write_EIO.
+* For current transaction, the bh is just among those error
+* bhs which previous transaction handle. We can't just clear
+* its BH_Write_EIO and reuse directly, since other bhs are
+* not written to disk yet and that will cause metadata
+* inconsistency. So we should set fs read-only to avoid
+* further damage.
 */
if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
-   clear_buffer_write_io_error(bh);
-   set_buffer_uptodate(bh);
-   }
-
-   if (!buffer_uptodate(bh)) {
unlock_buffer(bh);
-   return -EIO;
+   return ocfs2_error(osb->sb, "A previous attempt to "
+   "write this buffer head failed\n");
}
unlock_buffer(bh);
}
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-28 Thread piaojun

Hi Changwei,

On 2018/1/27 19:19, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/27 16:28, piaojun wrote:
>> We should not reuse the dirty bh in jbd2 directly due to the following
>> situation:
>>
>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>> truncate log at the same time, and hand them over to jbd2.
>> 2. The bhs are submitted to jbd2 area successfully.
>> 3. The write-back thread of device help flush the bhs to disk but
>> encounter write error due to abnormal storage link.
>> 4. After a while the storage link become normal. Truncate log flush
>> worker triggered by the next space reclaiming found the dirty bh of
>> truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
>> __ocfs2_journal_access():
>>
>> ocfs2_truncate_log_worker
>>ocfs2_flush_truncate_log
>>  __ocfs2_flush_truncate_log
>>ocfs2_replay_truncate_records
>>  ocfs2_journal_access_di
>>__ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
>> uptodata.
>>
>> 5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>> extent rec is still in error state, and unfortunately nobody will
>> take care of it.
>> 6. At last the space of extent rec was not reduced, but truncate log
>> flush worker have given it back to globalalloc. That will cause
>> duplicate cluster problem which could be identified by fsck.ocfs2.
>>
>> Sadlly we can hardly revert this but set fs read-only in case of
>> ruining atomicity and consistency of space reclaim.
>>
>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
>> __ocfs2_journal_access")
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
>> ---
>>   fs/ocfs2/journal.c | 49 ++---
>>   1 file changed, 38 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index 3630443..4c5661c 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -666,23 +666,50 @@ static int __ocfs2_journal_access(handle_t *handle,
>>  /* we can safely remove this assertion after testing. */
>>  if (!buffer_uptodate(bh)) {
>>  mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>> -mlog(ML_ERROR, "b_blocknr=%llu\n",
>> - (unsigned long long)bh->b_blocknr);
>> +mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>> + (unsigned long long)bh->b_blocknr, bh->b_state);
>>
>>  lock_buffer(bh);
>>  /*
>> - * A previous attempt to write this buffer head failed.
>> - * Nothing we can do but to retry the write and hope for
>> - * the best.
>> + * We should not reuse the dirty bh directly due to the
>> + * following situation:
>> + *
>> + * 1. When removing extent rec, we will dirty the bhs of
>> + *extent rec and truncate log at the same time, and
>> + *hand them over to jbd2.
>> + * 2. The bhs are submitted to jbd2 area successfully.
>> + * 3. The write-back thread of device help flush the bhs
>> + *to disk but encounter write error due to abnormal
>> + *storage link.
>> + * 4. After a while the storage link become normal.
>> + *Truncate log flush worker triggered by the next
>> + *space reclaiming found the dirty bh of truncate log
>> + *and clear its 'BH_Write_EIO' and then set it uptodate
>> + *in __ocfs2_journal_access():
>> + *
>> + *ocfs2_truncate_log_worker
>> + *  ocfs2_flush_truncate_log
>> + *__ocfs2_flush_truncate_log
>> + *  ocfs2_replay_truncate_records
>> + *ocfs2_journal_access_di
>> + *  __ocfs2_journal_access
>> + *
>> + * 5. Then jbd2 will flush the bh of truncate log to disk,
>> + *but the bh of extent rec is still in error state, and
>> + *unfortunately nobody will take care of it.
>> + * 6. At last the space of extent rec was not reduced,
>> + *but truncate log flush worker have given it back to

[Ocfs2-devel] [PATCH v2] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-27 Thread piaojun

We should not reuse the dirty bh in jbd2 directly due to the following
situation:

1. When removing extent rec, we will dirty the bhs of extent rec and
   truncate log at the same time, and hand them over to jbd2.
2. The bhs are submitted to jbd2 area successfully.
3. The write-back thread of device help flush the bhs to disk but
   encounter write error due to abnormal storage link.
4. After a while the storage link become normal. Truncate log flush
   worker triggered by the next space reclaiming found the dirty bh of
   truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
   __ocfs2_journal_access():

ocfs2_truncate_log_worker
  ocfs2_flush_truncate_log
__ocfs2_flush_truncate_log
  ocfs2_replay_truncate_records
ocfs2_journal_access_di
  __ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
uptodata.

5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
   extent rec is still in error state, and unfortunately nobody will
   take care of it.
6. At last the space of extent rec was not reduced, but truncate log
   flush worker have given it back to globalalloc. That will cause
   duplicate cluster problem which could be identified by fsck.ocfs2.

Sadlly we can hardly revert this but set fs read-only in case of
ruining atomicity and consistency of space reclaim.

Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
__ocfs2_journal_access")

Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/journal.c | 49 ++---
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 3630443..4c5661c 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -666,23 +666,50 @@ static int __ocfs2_journal_access(handle_t *handle,
/* we can safely remove this assertion after testing. */
if (!buffer_uptodate(bh)) {
mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
-   mlog(ML_ERROR, "b_blocknr=%llu\n",
-(unsigned long long)bh->b_blocknr);
+   mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
+(unsigned long long)bh->b_blocknr, bh->b_state);

lock_buffer(bh);
/*
-* A previous attempt to write this buffer head failed.
-* Nothing we can do but to retry the write and hope for
-* the best.
+* We should not reuse the dirty bh directly due to the
+* following situation:
+*
+* 1. When removing extent rec, we will dirty the bhs of
+*extent rec and truncate log at the same time, and
+*hand them over to jbd2.
+* 2. The bhs are submitted to jbd2 area successfully.
+* 3. The write-back thread of device help flush the bhs
+*to disk but encounter write error due to abnormal
+*storage link.
+* 4. After a while the storage link become normal.
+*Truncate log flush worker triggered by the next
+*space reclaiming found the dirty bh of truncate log
+*and clear its 'BH_Write_EIO' and then set it uptodate
+*in __ocfs2_journal_access():
+*
+*ocfs2_truncate_log_worker
+*  ocfs2_flush_truncate_log
+*__ocfs2_flush_truncate_log
+*  ocfs2_replay_truncate_records
+*ocfs2_journal_access_di
+*  __ocfs2_journal_access
+*
+* 5. Then jbd2 will flush the bh of truncate log to disk,
+*but the bh of extent rec is still in error state, and
+*unfortunately nobody will take care of it.
+* 6. At last the space of extent rec was not reduced,
+*but truncate log flush worker have given it back to
+*globalalloc. That will cause duplicate cluster problem
+*which could be identified by fsck.ocfs2.
+*
+* Sadlly we can hardly revert this but set fs read-only
+* in case of ruining atomicity and consistency of space
+* reclaim.
 */
if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
-   clear_buffer_write_io_error(bh);
-   set_buffer_uptodate(bh);
-   }
-
-   if (!buffer_uptodate(bh)) {
unlock_buffer(bh);
-   return -EIO;
+   return ocfs2_error(osb->sb, "A previous attempt to "
+   "write this buffer head failed\n");

Re: [Ocfs2-devel] [PATCH] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-26 Thread piaojun

Hi Changwei,

Thanks for quick repling, Gang and I are thinking about how to notice
user to recover this problem, and later I will post patch v2.

thanks
Jun

On 2018/1/27 13:17, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/27 11:52, piaojun wrote:
>> Hi Jan and Changwei,
>>
>> I will describle the scenario again as below:
>>
>> 1. The bhs of truncate log and extent rec are submitted to jbd2 area
>> successfully.
>> 2. Then write-back thread of device help flush the bhs to disk but
>> encounter write error due to abnormal storage link.
> 
> Now, your problem description makes sense.
> It seems you have withdrawn your last version of patch from -mm tree.
> I will look at your next version.
> 
> Thanks,
> Changwei
> 
>> 3. After a while the storage link become normal. Truncate log flush
>> worker triggered by the next space reclaiming found the dirty bh of
>> truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>> in __ocfs2_journal_access().
>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>> extent rec is still in error state, and unfortunately nobody will
>> take care of it.
>> 5. At last the space of extent rec was not reduced, but truncate log
>> flush worker have given it back to globalalloc. That will cause
>> duplicate cluster problem which could be identified by fsck.ocfs2.
>>
>> I suggest ocfs2 should handle this problem.
>>
>> thanks,
>> Jun
>>
>> On 2018/1/26 10:03, Changwei Ge wrote:
>>> On 2018/1/26 9:45, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> On 2018/1/26 9:00, Changwei Ge wrote:
>>>>> Hi Jun,
>>>>> Good morning:-)
>>>>>
>>>>> On 2018/1/25 20:45, piaojun wrote:
>>>>>> Hi Changwei,
>>>>>>
>>>>>> On 2018/1/25 20:30, Changwei Ge wrote:
>>>>>>> Hi Jun,
>>>>>>>
>>>>>>> On 2018/1/25 20:18, piaojun wrote:
>>>>>>>> Hi Changwei,
>>>>>>>>
>>>>>>>> On 2018/1/25 19:59, Changwei Ge wrote:
>>>>>>>>> Hi Jun,
>>>>>>>>>
>>>>>>>>> On 2018/1/25 10:41, piaojun wrote:
>>>>>>>>>> We should not reuse the dirty bh in jbd2 directly due to the 
>>>>>>>>>> following
>>>>>>>>>> situation:
>>>>>>>>>>
>>>>>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>>>>>>> Quick questions:
>>>>>>>>> Do you mean current code puts modifying extent records belonging to a 
>>>>>>>>> certain file and modifying truncate log into the same transaction?
>>>>>>>>> If so, forgive me since I didn't figure it out. Could you point out 
>>>>>>>>> in your following sequence diagram?
>>>>>>>>>
>>>>>>>>> Afterwards, I can understand the issue your change log is describing 
>>>>>>>>> better.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Changwei
>>>>>>>>>
>>>>>>>> Yes, I mean they are in the same transaction as below:
>>>>>>>>
>>>>>>>> ocfs2_remove_btree_range
>>>>>>>>   ocfs2_remove_extent // modify extent records
>>>>>>>> ocfs2_truncate_log_append // modify truncate log
>>>>>>>
>>>>>>> If so I think the transaction including operations on extent and 
>>>>>>> truncate log won't be committed.
>>>>>>> And journal should already be aborted if interval transaction commit 
>>>>>>> thread has been woken.
>>>>>>> So no metadata will be changed.
>>>>>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on 
>>>>>>> truncate log.
>>>>>>> Are you sure this is the root cause of your problem?
>>>>>>> I feel a little strange for it.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Changwei
>>>>>>>
>>>>>> As you said, the transaction was not committed, but after a while the
>>>>>> bh of truncate log was commit

Re: [Ocfs2-devel] [PATCH] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-26 Thread piaojun

Hi Jan and Changwei,

I will describle the scenario again as below:

1. The bhs of truncate log and extent rec are submitted to jbd2 area
   successfully.
2. Then write-back thread of device help flush the bhs to disk but
   encounter write error due to abnormal storage link.
3. After a while the storage link become normal. Truncate log flush
   worker triggered by the next space reclaiming found the dirty bh of
   truncate log and clear its 'BH_Write_EIO' and then set it uptodate
   in __ocfs2_journal_access().
4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
   extent rec is still in error state, and unfortunately nobody will
   take care of it.
5. At last the space of extent rec was not reduced, but truncate log
   flush worker have given it back to globalalloc. That will cause
   duplicate cluster problem which could be identified by fsck.ocfs2.

I suggest ocfs2 should handle this problem.

thanks,
Jun

On 2018/1/26 10:03, Changwei Ge wrote:
> On 2018/1/26 9:45, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/1/26 9:00, Changwei Ge wrote:
>>> Hi Jun,
>>> Good morning:-)
>>>
>>> On 2018/1/25 20:45, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> On 2018/1/25 20:30, Changwei Ge wrote:
>>>>> Hi Jun,
>>>>>
>>>>> On 2018/1/25 20:18, piaojun wrote:
>>>>>> Hi Changwei,
>>>>>>
>>>>>> On 2018/1/25 19:59, Changwei Ge wrote:
>>>>>>> Hi Jun,
>>>>>>>
>>>>>>> On 2018/1/25 10:41, piaojun wrote:
>>>>>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>>>>>> situation:
>>>>>>>>
>>>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>>>>> Quick questions:
>>>>>>> Do you mean current code puts modifying extent records belonging to a 
>>>>>>> certain file and modifying truncate log into the same transaction?
>>>>>>> If so, forgive me since I didn't figure it out. Could you point out in 
>>>>>>> your following sequence diagram?
>>>>>>>
>>>>>>> Afterwards, I can understand the issue your change log is describing 
>>>>>>> better.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Changwei
>>>>>>>
>>>>>> Yes, I mean they are in the same transaction as below:
>>>>>>
>>>>>> ocfs2_remove_btree_range
>>>>>>  ocfs2_remove_extent // modify extent records
>>>>>>ocfs2_truncate_log_append // modify truncate log
>>>>>
>>>>> If so I think the transaction including operations on extent and truncate 
>>>>> log won't be committed.
>>>>> And journal should already be aborted if interval transaction commit 
>>>>> thread has been woken.
>>>>> So no metadata will be changed.
>>>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate 
>>>>> log.
>>>>> Are you sure this is the root cause of your problem?
>>>>> I feel a little strange for it.
>>>>>
>>>>> Thanks,
>>>>> Changwei
>>>>>
>>>> As you said, the transaction was not committed, but after a while the
>>>> bh of truncate log was committed in another transaction. I'm sure for
>>>> the cause and after applying this patch, the duplicate cluster problem
>>>> is gone. I have tested it a few month.
>>>
>>> I think we are talking about two jbd2/transactions. right?
>> yes, two transactions involved.
>>
>>> One is for moving clusters from extent to truncate log. Let's name it T1.
>>> Anther is for declaiming clusters from truncate log and returning them back 
>>> to global bitmap. Let's name it T2.
>>>
>>> If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal will 
>>> be aborted which means it can't work any more.
>>> All following starting transaction and commit transaction will fail.
>>>
>>> So, how can the T2 be committed while T1 fails?
>>  From my testing jbd2 won't be aborted when encounter IO error, and I
>> print the bh->b_state = 0x44828 = 100010010101000. That means the
>> bh has been submitted but write IO, and still in jbd2 according to
>> 'bh_state_bits' and 'jbd_state_bits'.
> 
> Um... Str

Re: [Ocfs2-devel] [PATCH] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-25 Thread piaojun

Hi Gang,

On 2018/1/26 10:16, Gang He wrote:
> Hi Jun,
> 
> 

>> Hi Gang,
>>
>> Filesystem won't become readonly and journal remains normal, 
> How does the user aware this kind of issue happen?  watch the kernel message? 
> but some users do not like watch the kernel message except 
> there is serious problem (e.g. crash).  
Sadlly, user could only identify this problem by kernel message.

> so this problem needs user umount and mount again to recover. 
> If the user skip this error prints in the kernel message, there will be 
> further damage to the file system?
> If yes, we should let the user know this problem more obviously.
> And I'm thinking
This error won't cause further damage, but will make the operations
accessing truncate log bh failed, such as 'rm', 'truncate'. Before the
patch(acf8fdbe6afb), we set BUG here, but I think it's a little rude.
Perhaps we could set read-only or set jbd2 aborted.

thanks,
Jun

>> about if we should set readonly to notice user.
>>
>> thanks,
>> Jun
>>
>> On 2018/1/25 16:40, Gang He wrote:
>>> Hi Jun,
>>>
>>> If we return -EIO here, what is the consequence?
>>> make the journal aborted and file system will become read-only?
>>>
>>> Thanks
>>> Gang
>>>
>>>
>>
 We should not reuse the dirty bh in jbd2 directly due to the following
 situation:

 1. When removing extent rec, we will dirty the bhs of extent rec and
truncate log at the same time, and hand them over to jbd2.
 2. The bhs are not flushed to disk due to abnormal storage link.
 3. After a while the storage link become normal. Truncate log flush
worker triggered by the next space reclaiming found the dirty bh of
truncate log and clear its 'BH_Write_EIO' and then set it uptodate
in __ocfs2_journal_access():

 ocfs2_truncate_log_worker
   ocfs2_flush_truncate_log
 __ocfs2_flush_truncate_log
   ocfs2_replay_truncate_records
 ocfs2_journal_access_di
   __ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
 uptodata.

 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
extent rec is still in error state, and unfortunately nobody will
take care of it.
 5. At last the space of extent rec was not reduced, but truncate log
flush worker have given it back to globalalloc. That will cause
duplicate cluster problem which could be identified by fsck.ocfs2.

 So we should return -EIO in case of ruining atomicity and consistency
 of space reclaim.

 Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
 __ocfs2_journal_access")

 Signed-off-by: Jun Piao 
 Reviewed-by: Yiwen Jiang 
 ---
  fs/ocfs2/journal.c | 45 +++--
  1 file changed, 35 insertions(+), 10 deletions(-)

 diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
 index 3630443..d769ca2 100644
 --- a/fs/ocfs2/journal.c
 +++ b/fs/ocfs2/journal.c
 @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
/* we can safely remove this assertion after testing. */
if (!buffer_uptodate(bh)) {
mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
 -  mlog(ML_ERROR, "b_blocknr=%llu\n",
 -   (unsigned long long)bh->b_blocknr);
 +  mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
 +   (unsigned long long)bh->b_blocknr, bh->b_state);

lock_buffer(bh);
/*
 -   * A previous attempt to write this buffer head failed.
 -   * Nothing we can do but to retry the write and hope for
 -   * the best.
 +   * We should not reuse the dirty bh directly due to the
 +   * following situation:
 +   *
 +   * 1. When removing extent rec, we will dirty the bhs of
 +   *extent rec and truncate log at the same time, and
 +   *hand them over to jbd2.
 +   * 2. The bhs are not flushed to disk due to abnormal
 +   *storage link.
 +   * 3. After a while the storage link become normal.
 +   *Truncate log flush worker triggered by the next
 +   *space reclaiming found the dirty bh of truncate log
 +   *and clear its 'BH_Write_EIO' and then set it uptodate
 +   *in __ocfs2_journal_access():
 +   *
 +   *ocfs2_truncate_log_worker
 +   *  ocfs2_flush_truncate_log
 +   *__ocfs2_flush_truncate_log
 +   *  ocfs2_replay_truncate_records
 +   *ocfs2_journal_access_di
 +   *  __ocfs2_journal_access
 +   *
 +   * 4. Then jbd2 will flush the bh of

Re: [Ocfs2-devel] [PATCH] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-25 Thread piaojun

Hi Changwei,

On 2018/1/26 9:00, Changwei Ge wrote:
> Hi Jun,
> Good morning:-)
> 
> On 2018/1/25 20:45, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/1/25 20:30, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2018/1/25 20:18, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> On 2018/1/25 19:59, Changwei Ge wrote:
>>>>> Hi Jun,
>>>>>
>>>>> On 2018/1/25 10:41, piaojun wrote:
>>>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>>>> situation:
>>>>>>
>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>>> Quick questions:
>>>>> Do you mean current code puts modifying extent records belonging to a 
>>>>> certain file and modifying truncate log into the same transaction?
>>>>> If so, forgive me since I didn't figure it out. Could you point out in 
>>>>> your following sequence diagram?
>>>>>
>>>>> Afterwards, I can understand the issue your change log is describing 
>>>>> better.
>>>>>
>>>>> Thanks,
>>>>> Changwei
>>>>>
>>>> Yes, I mean they are in the same transaction as below:
>>>>
>>>> ocfs2_remove_btree_range
>>>> ocfs2_remove_extent // modify extent records
>>>>   ocfs2_truncate_log_append // modify truncate log
>>>
>>> If so I think the transaction including operations on extent and truncate 
>>> log won't be committed.
>>> And journal should already be aborted if interval transaction commit thread 
>>> has been woken.
>>> So no metadata will be changed.
>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate 
>>> log.
>>> Are you sure this is the root cause of your problem?
>>> I feel a little strange for it.
>>>
>>> Thanks,
>>> Changwei
>>>
>> As you said, the transaction was not committed, but after a while the
>> bh of truncate log was committed in another transaction. I'm sure for
>> the cause and after applying this patch, the duplicate cluster problem
>> is gone. I have tested it a few month.
> 
> I think we are talking about two jbd2/transactions. right?
yes, two transactions involved.

> One is for moving clusters from extent to truncate log. Let's name it T1.
> Anther is for declaiming clusters from truncate log and returning them back 
> to global bitmap. Let's name it T2.
> 
> If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal will be 
> aborted which means it can't work any more.
> All following starting transaction and commit transaction will fail.
> 
> So, how can the T2 be committed while T1 fails?
>From my testing jbd2 won't be aborted when encounter IO error, and I
print the bh->b_state = 0x44828 = 100010010101000. That means the
bh has been submitted but write IO, and still in jbd2 according to
'bh_state_bits' and 'jbd_state_bits'.

>
> Otherwise, did you ever try to recover jbd2/journal? If so, I think your 
> patch here is not fit for mainline yet.
> 
Currently this problem needs user umount and mount again to recover,
and I'm glad to hear your advice.

thanks,
Jun

> Thanks,
> Changwei
> 

>>
>> thanks,
>> Jun
>>
>>>>
>>>> thanks,
>>>> Jun
>>>>
>>>>>>   truncate log at the same time, and hand them over to jbd2.
>>>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>>>   worker triggered by the next space reclaiming found the dirty bh of
>>>>>>   truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>   in __ocfs2_journal_access():
>>>>>>
>>>>>> ocfs2_truncate_log_worker
>>>>>>  ocfs2_flush_truncate_log
>>>>>>__ocfs2_flush_truncate_log
>>>>>>  ocfs2_replay_truncate_records
>>>>>>ocfs2_journal_access_di
>>>>>>  __ocfs2_journal_access // here we clear io_error and set 
>>>>>> 'tl_bh' uptodata.
>>>>>>
>>>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>>>   extent rec is still in error state, and unfortunately nobody will
>>>>>>   t

Re: [Ocfs2-devel] [PATCH] ocfs2: unlock bh_state if bg check fails

2018-01-25 Thread piaojun

LGTM

On 2018/1/25 9:18, Changwei Ge wrote:
> We should unlock bh_stat if bg->bg_free_bits_count > bg->bg_bits
> 
> Suggested-by: Jan Kara 
> Signed-off-by: Changwei Ge 
Reviewed-by: Jun Piao 
> ---
>  fs/ocfs2/suballoc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
> index 71f22c8..6fee797 100644
> --- a/fs/ocfs2/suballoc.c
> +++ b/fs/ocfs2/suballoc.c
> @@ -2441,6 +2441,8 @@ static int ocfs2_block_group_clear_bits(handle_t 
> *handle,
>   }
>   le16_add_cpu(>bg_free_bits_count, num_bits);
>   if (le16_to_cpu(bg->bg_free_bits_count) > le16_to_cpu(bg->bg_bits)) {
> + if (undo_fn)
> + jbd_unlock_bh_state(group_bh);
>   return ocfs2_error(alloc_inode->i_sb, "Group descriptor # %llu 
> has bit count %u but claims %u are freed. num_bits %d\n",
>  (unsigned long 
> long)le64_to_cpu(bg->bg_blkno),
>  le16_to_cpu(bg->bg_bits),
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-25 Thread piaojun

Hi Changwei,

On 2018/1/25 20:30, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/25 20:18, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/1/25 19:59, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2018/1/25 10:41, piaojun wrote:
>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>> situation:
>>>>
>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>> Quick questions:
>>> Do you mean current code puts modifying extent records belonging to a 
>>> certain file and modifying truncate log into the same transaction?
>>> If so, forgive me since I didn't figure it out. Could you point out in your 
>>> following sequence diagram?
>>>
>>> Afterwards, I can understand the issue your change log is describing better.
>>>
>>> Thanks,
>>> Changwei
>>>
>> Yes, I mean they are in the same transaction as below:
>>
>> ocfs2_remove_btree_range
>>ocfs2_remove_extent // modify extent records
>>  ocfs2_truncate_log_append // modify truncate log
> 
> If so I think the transaction including operations on extent and truncate log 
> won't be committed.
> And journal should already be aborted if interval transaction commit thread 
> has been woken.
> So no metadata will be changed.
> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate log.
> Are you sure this is the root cause of your problem?
> I feel a little strange for it.
> 
> Thanks,
> Changwei
> 
As you said, the transaction was not committed, but after a while the
bh of truncate log was committed in another transaction. I'm sure for
the cause and after applying this patch, the duplicate cluster problem
is gone. I have tested it a few month.

thanks,
Jun

>>
>> thanks,
>> Jun
>>
>>>>  truncate log at the same time, and hand them over to jbd2.
>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>  worker triggered by the next space reclaiming found the dirty bh of
>>>>  truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>  in __ocfs2_journal_access():
>>>>
>>>> ocfs2_truncate_log_worker
>>>> ocfs2_flush_truncate_log
>>>>   __ocfs2_flush_truncate_log
>>>> ocfs2_replay_truncate_records
>>>>   ocfs2_journal_access_di
>>>> __ocfs2_journal_access // here we clear io_error and set 
>>>> 'tl_bh' uptodata.
>>>>
>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>  extent rec is still in error state, and unfortunately nobody will
>>>>  take care of it.
>>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>>  flush worker have given it back to globalalloc. That will cause
>>>>  duplicate cluster problem which could be identified by fsck.ocfs2.
>>>>
>>>> So we should return -EIO in case of ruining atomicity and consistency
>>>> of space reclaim.
>>>>
>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
>>>> __ocfs2_journal_access")
>>>>
>>>> Signed-off-by: Jun Piao <piao...@huawei.com>
>>>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
>>>> ---
>>>>fs/ocfs2/journal.c | 45 +++--
>>>>1 file changed, 35 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>> index 3630443..d769ca2 100644
>>>> --- a/fs/ocfs2/journal.c
>>>> +++ b/fs/ocfs2/journal.c
>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>>/* we can safely remove this assertion after testing. */
>>>>if (!buffer_uptodate(bh)) {
>>>>mlog(ML_ERROR, "giving me a buffer that's not 
>>>> uptodate!\n");
>>>> -  mlog(ML_ERROR, "b_blocknr=%llu\n",
>>>> -   (unsigned long long)bh->b_blocknr);
>>>> +  mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>>> +   (unsigned long long)bh->b_blocknr, bh->b_state);
>>>>
>>>>lock_buffer(bh);
>>>>

Re: [Ocfs2-devel] [PATCH] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-25 Thread piaojun

Hi Gang,

Filesystem won't become readonly and journal remains normal, so this
problem needs user umount and mount again to recover. And I'm thinking
about if we should set readonly to notice user.

thanks,
Jun

On 2018/1/25 16:40, Gang He wrote:
> Hi Jun,
> 
> If we return -EIO here, what is the consequence?
> make the journal aborted and file system will become read-only?
> 
> Thanks
> Gang
> 
> 

>> We should not reuse the dirty bh in jbd2 directly due to the following
>> situation:
>>
>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>truncate log at the same time, and hand them over to jbd2.
>> 2. The bhs are not flushed to disk due to abnormal storage link.
>> 3. After a while the storage link become normal. Truncate log flush
>>worker triggered by the next space reclaiming found the dirty bh of
>>truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>in __ocfs2_journal_access():
>>
>> ocfs2_truncate_log_worker
>>   ocfs2_flush_truncate_log
>> __ocfs2_flush_truncate_log
>>   ocfs2_replay_truncate_records
>> ocfs2_journal_access_di
>>   __ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
>> uptodata.
>>
>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>extent rec is still in error state, and unfortunately nobody will
>>take care of it.
>> 5. At last the space of extent rec was not reduced, but truncate log
>>flush worker have given it back to globalalloc. That will cause
>>duplicate cluster problem which could be identified by fsck.ocfs2.
>>
>> So we should return -EIO in case of ruining atomicity and consistency
>> of space reclaim.
>>
>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
>> __ocfs2_journal_access")
>>
>> Signed-off-by: Jun Piao 
>> Reviewed-by: Yiwen Jiang 
>> ---
>>  fs/ocfs2/journal.c | 45 +++--
>>  1 file changed, 35 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index 3630443..d769ca2 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>  /* we can safely remove this assertion after testing. */
>>  if (!buffer_uptodate(bh)) {
>>  mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>> -mlog(ML_ERROR, "b_blocknr=%llu\n",
>> - (unsigned long long)bh->b_blocknr);
>> +mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>> + (unsigned long long)bh->b_blocknr, bh->b_state);
>>
>>  lock_buffer(bh);
>>  /*
>> - * A previous attempt to write this buffer head failed.
>> - * Nothing we can do but to retry the write and hope for
>> - * the best.
>> + * We should not reuse the dirty bh directly due to the
>> + * following situation:
>> + *
>> + * 1. When removing extent rec, we will dirty the bhs of
>> + *extent rec and truncate log at the same time, and
>> + *hand them over to jbd2.
>> + * 2. The bhs are not flushed to disk due to abnormal
>> + *storage link.
>> + * 3. After a while the storage link become normal.
>> + *Truncate log flush worker triggered by the next
>> + *space reclaiming found the dirty bh of truncate log
>> + *and clear its 'BH_Write_EIO' and then set it uptodate
>> + *in __ocfs2_journal_access():
>> + *
>> + *ocfs2_truncate_log_worker
>> + *  ocfs2_flush_truncate_log
>> + *__ocfs2_flush_truncate_log
>> + *  ocfs2_replay_truncate_records
>> + *ocfs2_journal_access_di
>> + *  __ocfs2_journal_access
>> + *
>> + * 4. Then jbd2 will flush the bh of truncate log to disk,
>> + *but the bh of extent rec is still in error state, and
>> + *unfortunately nobody will take care of it.
>> + * 5. At last the space of extent rec was not reduced,
>> + *but truncate log flush worker have given it back to
>> + *globalalloc. That will cause duplicate cluster problem
>> + *which could be identified by fsck.ocfs2.
>> + *
>> + * So we should return -EIO in case of ruining atomicity
>> + * and consistency of space reclaim.
>>   */
>>  if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>> -clear_buffer_write_io_error(bh);
>> -set_buffer_uptodate(bh);
>> -}
>> -
>> -if (!buffer_uptodate(bh)) {
>> +mlog(ML_ERROR, "A

Re: [Ocfs2-devel] [PATCH] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-25 Thread piaojun

Hi Changwei,

On 2018/1/25 19:59, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/25 10:41, piaojun wrote:
>> We should not reuse the dirty bh in jbd2 directly due to the following
>> situation:
>>
>> 1. When removing extent rec, we will dirty the bhs of extent rec and
> Quick questions:
> Do you mean current code puts modifying extent records belonging to a certain 
> file and modifying truncate log into the same transaction?
> If so, forgive me since I didn't figure it out. Could you point out in your 
> following sequence diagram?
> 
> Afterwards, I can understand the issue your change log is describing better.
> 
> Thanks,
> Changwei
> 
Yes, I mean they are in the same transaction as below:

ocfs2_remove_btree_range
  ocfs2_remove_extent // modify extent records
ocfs2_truncate_log_append // modify truncate log

thanks,
Jun

>> truncate log at the same time, and hand them over to jbd2.
>> 2. The bhs are not flushed to disk due to abnormal storage link.
>> 3. After a while the storage link become normal. Truncate log flush
>> worker triggered by the next space reclaiming found the dirty bh of
>> truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>> in __ocfs2_journal_access():
>>
>> ocfs2_truncate_log_worker
>>ocfs2_flush_truncate_log
>>  __ocfs2_flush_truncate_log
>>ocfs2_replay_truncate_records
>>  ocfs2_journal_access_di
>>__ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
>> uptodata.
>>
>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>> extent rec is still in error state, and unfortunately nobody will
>> take care of it.
>> 5. At last the space of extent rec was not reduced, but truncate log
>> flush worker have given it back to globalalloc. That will cause
>> duplicate cluster problem which could be identified by fsck.ocfs2.
>>
>> So we should return -EIO in case of ruining atomicity and consistency
>> of space reclaim.
>>
>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
>> __ocfs2_journal_access")
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
>> ---
>>   fs/ocfs2/journal.c | 45 +++--
>>   1 file changed, 35 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index 3630443..d769ca2 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>  /* we can safely remove this assertion after testing. */
>>  if (!buffer_uptodate(bh)) {
>>  mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>> -mlog(ML_ERROR, "b_blocknr=%llu\n",
>> - (unsigned long long)bh->b_blocknr);
>> +mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>> + (unsigned long long)bh->b_blocknr, bh->b_state);
>>
>>  lock_buffer(bh);
>>  /*
>> - * A previous attempt to write this buffer head failed.
>> - * Nothing we can do but to retry the write and hope for
>> - * the best.
>> + * We should not reuse the dirty bh directly due to the
>> + * following situation:
>> + *
>> + * 1. When removing extent rec, we will dirty the bhs of
>> + *extent rec and truncate log at the same time, and
>> + *hand them over to jbd2.
>> + * 2. The bhs are not flushed to disk due to abnormal
>> + *storage link.
>> + * 3. After a while the storage link become normal.
>> + *Truncate log flush worker triggered by the next
>> + *space reclaiming found the dirty bh of truncate log
>> + *and clear its 'BH_Write_EIO' and then set it uptodate
>> + *in __ocfs2_journal_access():
>> + *
>> + *ocfs2_truncate_log_worker
>> + *  ocfs2_flush_truncate_log
>> + *__ocfs2_flush_truncate_log
>> + *  ocfs2_replay_truncate_records
>> + *ocfs2_journal_access_di
>> + *  __ocfs2_journal_access
>> + *
>> + * 4. Then jbd2 will flush the bh of truncate log to disk

[Ocfs2-devel] [PATCH] ocfs2: return error when we attempt to access a dirty bh in jbd2

2018-01-24 Thread piaojun

We should not reuse the dirty bh in jbd2 directly due to the following
situation:

1. When removing extent rec, we will dirty the bhs of extent rec and
   truncate log at the same time, and hand them over to jbd2.
2. The bhs are not flushed to disk due to abnormal storage link.
3. After a while the storage link become normal. Truncate log flush
   worker triggered by the next space reclaiming found the dirty bh of
   truncate log and clear its 'BH_Write_EIO' and then set it uptodate
   in __ocfs2_journal_access():

ocfs2_truncate_log_worker
  ocfs2_flush_truncate_log
__ocfs2_flush_truncate_log
  ocfs2_replay_truncate_records
ocfs2_journal_access_di
  __ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
uptodata.

4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
   extent rec is still in error state, and unfortunately nobody will
   take care of it.
5. At last the space of extent rec was not reduced, but truncate log
   flush worker have given it back to globalalloc. That will cause
   duplicate cluster problem which could be identified by fsck.ocfs2.

So we should return -EIO in case of ruining atomicity and consistency
of space reclaim.

Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
__ocfs2_journal_access")

Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/journal.c | 45 +++--
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 3630443..d769ca2 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
/* we can safely remove this assertion after testing. */
if (!buffer_uptodate(bh)) {
mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
-   mlog(ML_ERROR, "b_blocknr=%llu\n",
-(unsigned long long)bh->b_blocknr);
+   mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
+(unsigned long long)bh->b_blocknr, bh->b_state);

lock_buffer(bh);
/*
-* A previous attempt to write this buffer head failed.
-* Nothing we can do but to retry the write and hope for
-* the best.
+* We should not reuse the dirty bh directly due to the
+* following situation:
+*
+* 1. When removing extent rec, we will dirty the bhs of
+*extent rec and truncate log at the same time, and
+*hand them over to jbd2.
+* 2. The bhs are not flushed to disk due to abnormal
+*storage link.
+* 3. After a while the storage link become normal.
+*Truncate log flush worker triggered by the next
+*space reclaiming found the dirty bh of truncate log
+*and clear its 'BH_Write_EIO' and then set it uptodate
+*in __ocfs2_journal_access():
+*
+*ocfs2_truncate_log_worker
+*  ocfs2_flush_truncate_log
+*__ocfs2_flush_truncate_log
+*  ocfs2_replay_truncate_records
+*ocfs2_journal_access_di
+*  __ocfs2_journal_access
+*
+* 4. Then jbd2 will flush the bh of truncate log to disk,
+*but the bh of extent rec is still in error state, and
+*unfortunately nobody will take care of it.
+* 5. At last the space of extent rec was not reduced,
+*but truncate log flush worker have given it back to
+*globalalloc. That will cause duplicate cluster problem
+*which could be identified by fsck.ocfs2.
+*
+* So we should return -EIO in case of ruining atomicity
+* and consistency of space reclaim.
 */
if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
-   clear_buffer_write_io_error(bh);
-   set_buffer_uptodate(bh);
-   }
-
-   if (!buffer_uptodate(bh)) {
+   mlog(ML_ERROR, "A previous attempt to write this "
+   "buffer head failed\n");
unlock_buffer(bh);
return -EIO;
}
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-18 Thread piaojun

Hi Changwei,

On 2018/1/19 13:42, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/19 11:59, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/1/19 11:38, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2018/1/19 11:17, piaojun wrote:
>>>> Hi Jan, Eric and Changwei,
>>>>
>>>> Could we use another mutex lock to protect quota recovery? Sharing the
>>>> lock with VFS-layer probably seems a little weird.
>>>
>>> I am afraid that we can't since quota need ::s_umount and we indeed need
>>> ::s_umount to get rid of race that quota has freed structs that will be
>>> used by quota recovery in ocfs2.
>>>
>> Could you explain which 'structs' used by quota recovery? Do you mean
>> 'struct super_block'?
> I am not pointing to super_block.
> 
> Sure.
> You can refer to
> ocfs2_finish_quota_recovery
>ocfs2_recover_local_quota_file -> here, operations on quota happens
> 
> Thanks,
> Changwei
> 
I looked through the code in deactivate_locked_super(), and did not
find any *structs* needed to be protected except 'sb'. In addition I
still could not figure out why 'dqonoff_mutex' was relaced by
's_umount'. At least, the patch 5f530de63cfc did not mention that. I
will appreciate for detailed explaination.

""
ocfs2: Use s_umount for quota recovery protection

Currently we use dqonoff_mutex to serialize quota recovery protection
and turning of quotas on / off. Use s_umount semaphore instead.
""

thanks,
Jun

>>
>> thanks,
>> Jun
>>
>>>>
>>>> On 2018/1/19 9:48, Changwei Ge wrote:
>>>>> Hi Jan,
>>>>>
>>>>> On 2018/1/18 0:03, Jan Kara wrote:
>>>>>> On Wed 17-01-18 16:21:35, Jan Kara wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> On Fri 12-01-18 16:25:56, Eric Ren wrote:
>>>>>>>> On 01/12/2018 11:43 AM, Shichangkuo wrote:
>>>>>>>>> Hi all,
>>>>>>>>> 　　Now we are testing ocfs2 with 4.14 kernel, and we finding a 
>>>>>>>>> deadlock with umount and ocfs2 workqueue triggered by ocfs2rec 
>>>>>>>>> thread. The stack as follows:
>>>>>>>>> journal recovery work:
>>>>>>>>> [] call_rwsem_down_read_failed+0x14/0x30
>>>>>>>>> [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
>>>>>>>>> [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
>>>>>>>>> [] process_one_work+0x130/0x350
>>>>>>>>> [] worker_thread+0x46/0x3b0
>>>>>>>>> [] kthread+0x101/0x140
>>>>>>>>> [] ret_from_fork+0x1f/0x30
>>>>>>>>> [] 0x
>>>>>>>>>
>>>>>>>>> /bin/umount:
>>>>>>>>> [] flush_workqueue+0x104/0x3e0
>>>>>>>>> [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
>>>>>>>>> [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
>>>>>>>>> [] ocfs2_put_super+0x31/0xa0 [ocfs2]
>>>>>>>>> [] generic_shutdown_super+0x6d/0x120
>>>>>>>>> [] kill_block_super+0x2d/0x60
>>>>>>>>> [] deactivate_locked_super+0x51/0x90
>>>>>>>>> [] cleanup_mnt+0x3b/0x70
>>>>>>>>> [] task_work_run+0x86/0xa0
>>>>>>>>> [] exit_to_usermode_loop+0x6d/0xa9
>>>>>>>>> [] do_syscall_64+0x11d/0x130
>>>>>>>>> [] entry_SYSCALL64_slow_path+0x25/0x25
>>>>>>>>> [] 0x
>>>>>>>>> 　　
>>>>>>>>> Function ocfs2_finish_quota_recovery try to get sb->s_umount, which 
>>>>>>>>> was already locked by umount thread, then get a deadlock.
>>>>>>>>
>>>>>>>> Good catch, thanks for reporting.  Is it reproducible? Can you please 
>>>>>>>> share
>>>>>>>> the steps for reproducing this issue?
>>>>>>>>> This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 
>>>>>>>>> and 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
>>>>>>>>> I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
>>>>>>>>> already removed.
>>>>>>>>> Shall we add a new mutex?
>>>>>>>&

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-18 Thread piaojun

Hi Changwei,

On 2018/1/19 11:38, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/19 11:17, piaojun wrote:
>> Hi Jan, Eric and Changwei,
>>
>> Could we use another mutex lock to protect quota recovery? Sharing the
>> lock with VFS-layer probably seems a little weird.
> 
> I am afraid that we can't since quota need ::s_umount and we indeed need 
> ::s_umount to get rid of race that quota has freed structs that will be 
> used by quota recovery in ocfs2.
> 
Could you explain which 'structs' used by quota recovery? Do you mean
'struct super_block'?

thanks,
Jun

>>
>> On 2018/1/19 9:48, Changwei Ge wrote:
>>> Hi Jan,
>>>
>>> On 2018/1/18 0:03, Jan Kara wrote:
>>>> On Wed 17-01-18 16:21:35, Jan Kara wrote:
>>>>> Hello,
>>>>>
>>>>> On Fri 12-01-18 16:25:56, Eric Ren wrote:
>>>>>> On 01/12/2018 11:43 AM, Shichangkuo wrote:
>>>>>>> Hi all,
>>>>>>> 　　Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock 
>>>>>>> with umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack 
>>>>>>> as follows:
>>>>>>> journal recovery work:
>>>>>>> [] call_rwsem_down_read_failed+0x14/0x30
>>>>>>> [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
>>>>>>> [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
>>>>>>> [] process_one_work+0x130/0x350
>>>>>>> [] worker_thread+0x46/0x3b0
>>>>>>> [] kthread+0x101/0x140
>>>>>>> [] ret_from_fork+0x1f/0x30
>>>>>>> [] 0x
>>>>>>>
>>>>>>> /bin/umount:
>>>>>>> [] flush_workqueue+0x104/0x3e0
>>>>>>> [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
>>>>>>> [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
>>>>>>> [] ocfs2_put_super+0x31/0xa0 [ocfs2]
>>>>>>> [] generic_shutdown_super+0x6d/0x120
>>>>>>> [] kill_block_super+0x2d/0x60
>>>>>>> [] deactivate_locked_super+0x51/0x90
>>>>>>> [] cleanup_mnt+0x3b/0x70
>>>>>>> [] task_work_run+0x86/0xa0
>>>>>>> [] exit_to_usermode_loop+0x6d/0xa9
>>>>>>> [] do_syscall_64+0x11d/0x130
>>>>>>> [] entry_SYSCALL64_slow_path+0x25/0x25
>>>>>>> [] 0x
>>>>>>> 　　
>>>>>>> Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was 
>>>>>>> already locked by umount thread, then get a deadlock.
>>>>>>
>>>>>> Good catch, thanks for reporting.  Is it reproducible? Can you please 
>>>>>> share
>>>>>> the steps for reproducing this issue?
>>>>>>> This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 
>>>>>>> and 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
>>>>>>> I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
>>>>>>> already removed.
>>>>>>> Shall we add a new mutex?
>>>>>>
>>>>>> @Jan, I don't look into the code yet, could you help me understand why we
>>>>>> need to get sb->s_umount in ocfs2_finish_quota_recovery?
>>>>>> Is it because that the quota recovery process will start at umounting? or
>>>>>> some where else?
>>>>>
>>>>> I was refreshing my memory wrt how ocfs2 quota recovery works. The problem
>>>>> is the following: We load information about all quota information that
>>>>> needs recovering (this is possibly for other nodes) in
>>>>> ocfs2_begin_quota_recovery() that gets called during mount. Real quota
>>>>> recovery happens from the recovery thread in 
>>>>> ocfs2_finish_quota_recovery().
>>>>> We need to protect code running there from dquot_disable() calls as that
>>>>> will free structures we use for updating quota information etc. Currently
>>>>> we use sb->s_umount for that protection.
>>>>>
>>>>> The problem above apparently happens when someone calls umount before the
>>>>> recovery thread can finish quota recovery. I will think more about how to
>>>>> fix the locking so that this lock inversion does not happen...
>>>>
>>>> So could we move ocfs2_recovery_exit

Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and ocfs2 workqueue triggered by ocfs2rec thread

2018-01-18 Thread piaojun

Hi Jan, Eric and Changwei,

Could we use another mutex lock to protect quota recovery? Sharing the
lock with VFS-layer probably seems a little weird.

On 2018/1/19 9:48, Changwei Ge wrote:
> Hi Jan,
> 
> On 2018/1/18 0:03, Jan Kara wrote:
>> On Wed 17-01-18 16:21:35, Jan Kara wrote:
>>> Hello,
>>>
>>> On Fri 12-01-18 16:25:56, Eric Ren wrote:
 On 01/12/2018 11:43 AM, Shichangkuo wrote:
> Hi all,
> 　　Now we are testing ocfs2 with 4.14 kernel, and we finding a deadlock 
> with umount and ocfs2 workqueue triggered by ocfs2rec thread. The stack 
> as follows:
> journal recovery work:
> [] call_rwsem_down_read_failed+0x14/0x30
> [] ocfs2_finish_quota_recovery+0x62/0x450 [ocfs2]
> [] ocfs2_complete_recovery+0xc1/0x440 [ocfs2]
> [] process_one_work+0x130/0x350
> [] worker_thread+0x46/0x3b0
> [] kthread+0x101/0x140
> [] ret_from_fork+0x1f/0x30
> [] 0x
>
> /bin/umount:
> [] flush_workqueue+0x104/0x3e0
> [] ocfs2_truncate_log_shutdown+0x3b/0xc0 [ocfs2]
> [] ocfs2_dismount_volume+0x8c/0x3d0 [ocfs2]
> [] ocfs2_put_super+0x31/0xa0 [ocfs2]
> [] generic_shutdown_super+0x6d/0x120
> [] kill_block_super+0x2d/0x60
> [] deactivate_locked_super+0x51/0x90
> [] cleanup_mnt+0x3b/0x70
> [] task_work_run+0x86/0xa0
> [] exit_to_usermode_loop+0x6d/0xa9
> [] do_syscall_64+0x11d/0x130
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x
> 　　
> Function ocfs2_finish_quota_recovery try to get sb->s_umount, which was 
> already locked by umount thread, then get a deadlock.

 Good catch, thanks for reporting.  Is it reproducible? Can you please share
 the steps for reproducing this issue?
> This issue was introduced by c3b004460d77bf3f980d877be539016f2df4df12 and 
> 5f530de63cfc6ca8571cbdf58af63fb166cc6517.
> I think we cannot use :: s_umount, but the mutex ::dqonoff_mutex was 
> already removed.
> Shall we add a new mutex?

 @Jan, I don't look into the code yet, could you help me understand why we
 need to get sb->s_umount in ocfs2_finish_quota_recovery?
 Is it because that the quota recovery process will start at umounting? or
 some where else?
>>>
>>> I was refreshing my memory wrt how ocfs2 quota recovery works. The problem
>>> is the following: We load information about all quota information that
>>> needs recovering (this is possibly for other nodes) in
>>> ocfs2_begin_quota_recovery() that gets called during mount. Real quota
>>> recovery happens from the recovery thread in ocfs2_finish_quota_recovery().
>>> We need to protect code running there from dquot_disable() calls as that
>>> will free structures we use for updating quota information etc. Currently
>>> we use sb->s_umount for that protection.
>>>
>>> The problem above apparently happens when someone calls umount before the
>>> recovery thread can finish quota recovery. I will think more about how to
>>> fix the locking so that this lock inversion does not happen...
>>
>> So could we move ocfs2_recovery_exit() call in ocfs2_dismount_volume() up
>> before ocfs2_disable_quotas()? It seems possible to me, I'm just not sure
>> if there are not some hidden dependencies on recovery being shut down only
>> after truncate log / local alloc. If we can do that, we could remove
>> s_umount protection from ocfs2_finish_quota_recovery() and thus resolve the
>> race.
>>
>>  Honza
> 
> Thanks for looking into this.
> I am not quite familiar with quota part.:)
> 
> Or can we move ocfs2_disable_quotas() in ocfs2_dismount_volume down 
> after ocfs2_recovery_exit() with ::invoking down_read(>s_umount) 
> eliminated?
> 
> Another way I can figure out is:
> I think we might get inspired from qsync_work_fn().
> In that function if current work is under running context of umount with 
> ::s_umount held, it just delays current work to next time.
> 
> So can we also _try lock_ in ocfs2_finish_quota_recovery() and recover 
> quota by other ocfs2 cluster member nodes or local node's next time of 
> mount?
> 
I guess we need analyse the impact of _try lock_. Such as no other node
will help recovering quota when I'm the only node in cluster.

thanks,
Jun

> Thanks,
> Changwei
> 
> 
>>
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: clean dead code up in alloc.c

2018-01-16 Thread piaojun

Hi Changwei,

LGTM

On 2018/1/16 20:17, Changwei Ge wrote:
> Some stack variables are no longer used but still assigned.
> Trim them.
> 
> Signed-off-by: Changwei Ge 
Reviewed-by: Jun Piao 
> ---
>  fs/ocfs2/alloc.c | 11 ++-
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
> index addd7c5..edef99c 100644
> --- a/fs/ocfs2/alloc.c
> +++ b/fs/ocfs2/alloc.c
> @@ -2598,11 +2598,8 @@ static void ocfs2_unlink_subtree(handle_t *handle,
>   int i;
>   struct buffer_head *root_bh = left_path->p_node[subtree_index].bh;
>   struct ocfs2_extent_list *root_el = left_path->p_node[subtree_index].el;
> - struct ocfs2_extent_list *el;
>   struct ocfs2_extent_block *eb;
>  
> - el = path_leaf_el(left_path);
> -
>   eb = (struct ocfs2_extent_block *)right_path->p_node[subtree_index + 
> 1].bh->b_data;
>  
>   for(i = 1; i < le16_to_cpu(root_el->l_next_free_rec); i++)
> @@ -3940,7 +3937,7 @@ static void ocfs2_adjust_rightmost_records(handle_t 
> *handle,
>  struct ocfs2_path *path,
>  struct ocfs2_extent_rec *insert_rec)
>  {
> - int ret, i, next_free;
> + int i, next_free;
>   struct buffer_head *bh;
>   struct ocfs2_extent_list *el;
>   struct ocfs2_extent_rec *rec;
> @@ -3957,7 +3954,6 @@ static void ocfs2_adjust_rightmost_records(handle_t 
> *handle,
>   ocfs2_error(ocfs2_metadata_cache_get_super(et->et_ci),
>   "Owner %llu has a bad extent list\n",
>   (unsigned long 
> long)ocfs2_metadata_cache_owner(et->et_ci));
> - ret = -EIO;
>   return;
>   }
>  
> @@ -5059,7 +5055,6 @@ int ocfs2_split_extent(handle_t *handle,
>   struct buffer_head *last_eb_bh = NULL;
>   struct ocfs2_extent_rec *rec = >l_recs[split_index];
>   struct ocfs2_merge_ctxt ctxt;
> - struct ocfs2_extent_list *rightmost_el;
>  
>   if (le32_to_cpu(rec->e_cpos) > le32_to_cpu(split_rec->e_cpos) ||
>   ((le32_to_cpu(rec->e_cpos) + le16_to_cpu(rec->e_leaf_clusters)) <
> @@ -5095,9 +5090,7 @@ int ocfs2_split_extent(handle_t *handle,
>   }
>  
>   eb = (struct ocfs2_extent_block *) last_eb_bh->b_data;
> - rightmost_el = >h_list;
> - } else
> - rightmost_el = path_root_el(path);
> + }
>  
>   if (rec->e_cpos == split_rec->e_cpos &&
>   rec->e_leaf_clusters == split_rec->e_leaf_clusters)
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute

2018-01-16 Thread piaojun

The race between *set_acl and *get_acl will cause getting incomplete
xattr data as below:

processAprocessB

ocfs2_set_acl
  ocfs2_xattr_set
__ocfs2_xattr_set_handle

ocfs2_get_acl_nolock
  ocfs2_xattr_get_nolock:

processB may get incomplete xattr data if processA hasn't set_acl done.
So we should use 'ip_xattr_sem' to protect getting extended attribute
in ocfs2_get_acl_nolock(), as other processes could be changing it
concurrently.

Signed-off-by: Jun Piao 
Reviewed-by: Alex Chen 
---
 fs/ocfs2/acl.c   | 6 ++
 fs/ocfs2/xattr.c | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/fs/ocfs2/acl.c b/fs/ocfs2/acl.c
index 40b5cc9..917fadc 100644
--- a/fs/ocfs2/acl.c
+++ b/fs/ocfs2/acl.c
@@ -311,7 +311,9 @@ struct posix_acl *ocfs2_iop_get_acl(struct inode *inode, 
int type)
if (had_lock < 0)
return ERR_PTR(had_lock);

+   down_read(_I(inode)->ip_xattr_sem);
acl = ocfs2_get_acl_nolock(inode, type, di_bh);
+   up_read(_I(inode)->ip_xattr_sem);

ocfs2_inode_unlock_tracker(inode, 0, , had_lock);
brelse(di_bh);
@@ -330,7 +332,9 @@ int ocfs2_acl_chmod(struct inode *inode, struct buffer_head 
*bh)
if (!(osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL))
return 0;

+   down_read(_I(inode)->ip_xattr_sem);
acl = ocfs2_get_acl_nolock(inode, ACL_TYPE_ACCESS, bh);
+   up_read(_I(inode)->ip_xattr_sem);
if (IS_ERR(acl) || !acl)
return PTR_ERR(acl);
ret = __posix_acl_chmod(, GFP_KERNEL, inode->i_mode);
@@ -361,8 +365,10 @@ int ocfs2_init_acl(handle_t *handle,

if (!S_ISLNK(inode->i_mode)) {
if (osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL) {
+   down_read(_I(dir)->ip_xattr_sem);
acl = ocfs2_get_acl_nolock(dir, ACL_TYPE_DEFAULT,
   dir_bh);
+   up_read(_I(dir)->ip_xattr_sem);
if (IS_ERR(acl))
return PTR_ERR(acl);
}
diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index 439f567..adeebcb 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -638,9 +638,11 @@ int ocfs2_calc_xattr_init(struct inode *dir,
 si->value_len);

if (osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL) {
+   down_read(_I(dir)->ip_xattr_sem);
acl_len = ocfs2_xattr_get_nolock(dir, dir_bh,
OCFS2_XATTR_INDEX_POSIX_ACL_DEFAULT,
"", NULL, 0);
+   up_read(_I(dir)->ip_xattr_sem);
if (acl_len > 0) {
a_size = ocfs2_xattr_entry_real_size(0, acl_len);
if (S_ISDIR(mode))
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] Ocfs2-devel Digest, Vol 167, Issue 20

2018-01-12 Thread piaojun

Hi Guozhonghua,

It seems that deadlock could be reproduced easily, right? Sharing the
lock with VFS-layer probably is risky, and introducing a new lock for
"quota_recovery" sounds good. Could you post a patch to fix this
problem?

thanks,
Jun

On 2018/1/13 11:04, Guozhonghua wrote:
> 
>> Message: 1
>> Date: Fri, 12 Jan 2018 06:15:01 +
>> From: Shichangkuo 
>> Subject: Re: [Ocfs2-devel] [Ocfs2-dev] BUG: deadlock with umount and
>>  ocfs2 workqueue triggered by ocfs2rec thread
>> To: Joseph Qi , "z...@suse.com" ,
>>  "j...@suse.cz" 
>> Cc: "ocfs2-devel@oss.oracle.com" 
>> Message-ID:
>>  > i-3com.com>
>>
>> Content-Type: text/plain; charset="gb2312"
>>
>> Hi Joseph
>> Thanks for replying.
>> Umount will flush the ocfs2 workqueue in function
>> ocfs2_truncate_log_shutdown and journal recovery is one work of ocfs2 wq.
>>
>> Thanks
>> Changkuo
>>
> 
> Umount 
>   mngput
>cleanup_mnt 
>deactivate_super:   down_write the rw_semaphore:  
> down_write(>s_umount)
>   deactivate_locked_super
> kill_sb: kill_block_super
>   generic_shutdown_super
>   put_super : ocfs2_put_supe
>   ocfs2_dismount_volume
>   ocfs2_truncate_log_shutdown 
>   
> flush_workqueue(osb->ocfs2_wq);
>   
> ocfs2_finish_quota_recovery
>   
> down_read(>s_umount); 
>Here 
> retry down_read rw_semaphore; down read while holding write?  Try 
> rw_semaphore twice, dead lock ?
>The flush work queue of ocfs2_wq will 
> be blocked, so as the umount ops.  
> 
> Thanks. 
> 
> 
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2/xattr: assign errno to 'ret' in ocfs2_calc_xattr_init()

2018-01-11 Thread piaojun

We need catch the errno returned by ocfs2_xattr_get_nolock() and assign
it to 'ret' for printing and noticing upper callers.

Signed-off-by: Jun Piao 
Reviewed-by: Alex Chen 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/xattr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index 5fdf269..439f567 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -646,6 +646,7 @@ int ocfs2_calc_xattr_init(struct inode *dir,
if (S_ISDIR(mode))
a_size <<= 1;
} else if (acl_len != 0 && acl_len != -ENODATA) {
+   ret = acl_len;
mlog_errno(ret);
return ret;
}
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] kernel bug

2018-01-10 Thread piaojun

Hi Cédric,

You'd better paste the core dump stack and the method of reproducing
this BUG.

thanks,
Jun

On 2018/1/10 19:48, BASSAGET Cédric wrote:
> Hello
> Today I reported a bug related to ocfs2 1.8.on proxmox forum, maybe somebody 
> here will be able to help me...
> The bog on proxmox forum : 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__forum.proxmox.com_threads_ocfs2-2Dkernel-2Dbug.39163_=DwID-g=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y=Cjs_sZYvlRtRCpsPU_lWPoen_rJyr14Cw3AxxedrGac=1IlEt5_VsnceBwz_3AYQq8zKNy6viF9oxQGtp8odqn4=
>  
> 
> 
> to resume : kernel BUG at fs/ocfs2/suballoc.c:2017!
> 
> is this related to proxmox kernel or ocfs2 ? or both ?
> does it have something to do with 
> https://oss.oracle.com/pipermail/ocfs2-devel/2017-January/012701.html ?
> 
> Thanks for your help.
> Regards,
> Cédric
> 
> 
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2017-12-28 Thread piaojun

LGTM

On 2017/12/28 15:48, Gang He wrote:
> If we can't get inode lock immediately in the function
> ocfs2_inode_lock_with_page() when reading a page, we should not
> return directly here, since this will lead to a softlockup problem
> when the kernel is configured with CONFIG_PREEMPT is not set.
> The method is to get a blocking lock and immediately unlock before
> returning, this can avoid CPU resource waste due to lots of retries,
> and benefits fairness in getting lock among multiple nodes, increase
> efficiency in case modifying the same file frequently from multiple
> nodes.
> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
> looks like,
> Kernel panic - not syncing: softlockup: hung tasks
> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
>   
>   dump_stack+0x5c/0x82
>   panic+0xd5/0x21e
>   watchdog_timer_fn+0x208/0x210
>   ? watchdog_park_threads+0x70/0x70
>   __hrtimer_run_queues+0xcc/0x200
>   hrtimer_interrupt+0xa6/0x1f0
>   smp_apic_timer_interrupt+0x34/0x50
>   apic_timer_interrupt+0x96/0xa0
>   
>  RIP: 0010:unlock_page+0x17/0x30
>  RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>  RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>  RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>  RBP:  R08:  R09: af154080bb00
>  R10: af154080bc30 R11: 0040 R12: 993749a39518
>  R13:  R14: f21e009f5300 R15: f21e009f5300
>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>   ? pagecache_get_page+0x30/0x200
>   filemap_fault+0x12b/0x5c0
>   ? recalc_sigpending+0x17/0x50
>   ? __set_task_blocked+0x28/0x70
>   ? __set_current_blocked+0x3d/0x60
>   ocfs2_fault+0x29/0xb0 [ocfs2]
>   __do_fault+0x1a/0xa0
>   __handle_mm_fault+0xbe8/0x1090
>   handle_mm_fault+0xaa/0x1f0
>   __do_page_fault+0x235/0x4b0
>   trace_do_page_fault+0x3c/0x110
>   async_page_fault+0x28/0x30
>  RIP: 0033:0x7fa75ded638e
>  RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>  RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700
>  RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700
>  RBP: 0003 R08: 000e R09: 
>  R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770
>  R13: 000e R14: 1770 R15: 
> 
> About performance improvement, we can see the testing time is reduced,
> and CPU utilization decreases, the detailed data is as follows.
> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
> Before apply this patch,
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>  2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 
> multi_mmap
>  1505 root  rt   0  36 123060  97224 S 2.658 6.015   0:01.44 corosync
> 5 root  20   0   0  0  0 S 1.329 0.000   0:00.19 
> kworker/u8:0
>95 root  20   0   0  0  0 S 1.329 0.000   0:00.25 
> kworker/u8:1
>  2728 root  20   0   0  0  0 S 0.997 0.000   0:00.24 
> jbd2/sda1-33
>  2721 root  20   0   0  0  0 S 0.664 0.000   0:00.07 
> ocfs2dc-3C8CFD4
>  2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun
> 
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o 
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d 
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 14:44:52 CST 2017
> multi_mmap..Passed.
> Runtime 783 seconds.
> 
> After apply this patch,
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>  2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 
> multi_mmap
>   155 root  20   0   0  0  0 S 2.667 0.000   0:01.20 
> kworker/u8:3
>95 root  20   0   0  0  0 S 2.000 0.000   0:01.58 
> kworker/u8:1
>  2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
> 5 root  20   0   0  0  0 S 1.000 0.000   0:01.36 
> kworker/u8:0
>  2482 root  20   0   0  0  0 S 1.000 0.000   0:00.86 
> jbd2/sda1-33
>   299 root   0 -20   0  0  0 S 0.333 0.000   0:00.13 
> kworker/2:1H
>   335 root   0 -20   0  0  0 S 0.333 0.000   0:00.17 
> kworker/1:1H
>   535 root  20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
>  1282 root  rt   0  84 123108  97224 S 0.333 6.017   0:01.33 corosync
> 
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o 
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d 
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 15:04:12 CST 2017

Re: [Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2017-12-27 Thread piaojun

Hi Gang,

This patch looks good to me.

thanks,
Jun

On 2017/12/28 10:58, Gang He wrote:
> 
> 
> 

>> Hi Gang,
>>
>> You cleared my doubt. Should we handle the errno of ocfs2_inode_lock()
>> or just use mlog_errno()?
> Hi Jun, I think it is not necessary, since we just want to hold a while 
> before get the DLM lock,
> we do not care about the result, since we will unlock immediately here.
> In fact, this patch does NOT add new code, just revert the old patch 
> 1cce4df04f37, and add 
> more clear comments in the front of these two lines code.
> 
> Thanks
> Gang
> 
>>
>> thanks,
>> Jun
>>
>> On 2017/12/28 10:11, Gang He wrote:
>>> Hi Jun,
>>>
>>>
>>
 Hi Gang,

 Thanks for your explaination, and I just have one more question. Could
 we use 'ocfs2_inode_lock' instead of 'ocfs2_inode_lock_full' to avoid
 -EAGAIN circularly?
>>> No, please see the comments above the function  
>> ocfs2_inode_lock_with_page(),
>>> there will be probably a deadlock between tasks acquiring DLM
>>> locks while holding a page lock and the downconvert thread which
>>> blocks dlm lock acquiry while acquiring page locks.
>>> Then, the OCFS2_LOCK_NONBLOCK flag was introduced as a workaround to
>>> avoid this case.
>>>
>>> Thanks
>>> Gang
>>>

 thanks,
 Jun

 On 2017/12/27 18:37, Gang He wrote:
> Hi Jun,
>
>

>> Hi Gang,
>>
>> Do you mean that too many retrys in loop cast losts of CPU-time and
>> block page-fault interrupt? We should not add any delay in
>> ocfs2_fault(), right? And I still feel a little confused why your
>> method can solve this problem.
> You can see the related code in function filemap_fault(), if ocfs2 fails 
> to 
 read a page since 
> it can not get a inode lock with non-block mode, the VFS layer code will 
 invoke ocfs2
> read page call back function circularly, this will lead to a softlockup 
 problem (like the below back trace).
> So, we should get a blocking lock to let the dlm lock to this node and 
> also 
 can avoid CPU loop,
> second, base on my testing, the patch also can improve the efficiency in 
 case modifying the same
> file frequently from multiple nodes, since the lock acquisition chance is 
 more fair.
> In fact, the code was modified by a patch 1cce4df04f37 ("ocfs2: do not 
 lock/unlock() inode DLM lock"),
> before that patch, the code is the same, this patch can be considered to 
 revert that patch, except adding more
> clear comments.
>  
> Thanks
> Gang
>
>
>>
>> thanks,
>> Jun
>>
>> On 2017/12/27 17:29, Gang He wrote:
>>> If we can't get inode lock immediately in the function
>>> ocfs2_inode_lock_with_page() when reading a page, we should not
>>> return directly here, since this will lead to a softlockup problem.
>>> The method is to get a blocking lock and immediately unlock before
>>> returning, this can avoid CPU resource waste due to lots of retries,
>>> and benefits fairness in getting lock among multiple nodes, increase
>>> efficiency in case modifying the same file frequently from multiple
>>> nodes.
>>> The softlockup problem looks like,
>>> Kernel panic - not syncing: softlockup: hung tasks
>>> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
>>> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>>> Call Trace:
>>>   
>>>   dump_stack+0x5c/0x82
>>>   panic+0xd5/0x21e
>>>   watchdog_timer_fn+0x208/0x210
>>>   ? watchdog_park_threads+0x70/0x70
>>>   __hrtimer_run_queues+0xcc/0x200
>>>   hrtimer_interrupt+0xa6/0x1f0
>>>   smp_apic_timer_interrupt+0x34/0x50
>>>   apic_timer_interrupt+0x96/0xa0
>>>   
>>>  RIP: 0010:unlock_page+0x17/0x30
>>>  RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>>>  RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>>>  RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>>>  RBP:  R08:  R09: af154080bb00
>>>  R10: af154080bc30 R11: 0040 R12: 993749a39518
>>>  R13:  R14: f21e009f5300 R15: f21e009f5300
>>>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>>>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>>>   ? pagecache_get_page+0x30/0x200
>>>   filemap_fault+0x12b/0x5c0
>>>   ? recalc_sigpending+0x17/0x50
>>>   ? __set_task_blocked+0x28/0x70
>>>   ? __set_current_blocked+0x3d/0x60
>>>   ocfs2_fault+0x29/0xb0 [ocfs2]
>>>   __do_fault+0x1a/0xa0
>>>   __handle_mm_fault+0xbe8/0x1090
>>>   handle_mm_fault+0xaa/0x1f0
>>>   __do_page_fault+0x235/0x4b0
>>>   trace_do_page_fault+0x3c/0x110
>>>   async_page_fault+0x28/0x30
>>>  RIP: 0033:0x7fa75ded638e
>>>  RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>>>  RAX: 55c7662fb700 RBX:

Re: [Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2017-12-27 Thread piaojun

Hi Gang,

You cleared my doubt. Should we handle the errno of ocfs2_inode_lock()
or just use mlog_errno()?

thanks,
Jun

On 2017/12/28 10:11, Gang He wrote:
> Hi Jun,
> 
> 

>> Hi Gang,
>>
>> Thanks for your explaination, and I just have one more question. Could
>> we use 'ocfs2_inode_lock' instead of 'ocfs2_inode_lock_full' to avoid
>> -EAGAIN circularly?
> No, please see the comments above the function  ocfs2_inode_lock_with_page(),
> there will be probably a deadlock between tasks acquiring DLM
> locks while holding a page lock and the downconvert thread which
> blocks dlm lock acquiry while acquiring page locks.
> Then, the OCFS2_LOCK_NONBLOCK flag was introduced as a workaround to
> avoid this case.
> 
> Thanks
> Gang
> 
>>
>> thanks,
>> Jun
>>
>> On 2017/12/27 18:37, Gang He wrote:
>>> Hi Jun,
>>>
>>>
>>
 Hi Gang,

 Do you mean that too many retrys in loop cast losts of CPU-time and
 block page-fault interrupt? We should not add any delay in
 ocfs2_fault(), right? And I still feel a little confused why your
 method can solve this problem.
>>> You can see the related code in function filemap_fault(), if ocfs2 fails to 
>> read a page since 
>>> it can not get a inode lock with non-block mode, the VFS layer code will 
>> invoke ocfs2
>>> read page call back function circularly, this will lead to a softlockup 
>> problem (like the below back trace).
>>> So, we should get a blocking lock to let the dlm lock to this node and also 
>> can avoid CPU loop,
>>> second, base on my testing, the patch also can improve the efficiency in 
>> case modifying the same
>>> file frequently from multiple nodes, since the lock acquisition chance is 
>> more fair.
>>> In fact, the code was modified by a patch 1cce4df04f37 ("ocfs2: do not 
>> lock/unlock() inode DLM lock"),
>>> before that patch, the code is the same, this patch can be considered to 
>> revert that patch, except adding more
>>> clear comments.
>>>  
>>> Thanks
>>> Gang
>>>
>>>

 thanks,
 Jun

 On 2017/12/27 17:29, Gang He wrote:
> If we can't get inode lock immediately in the function
> ocfs2_inode_lock_with_page() when reading a page, we should not
> return directly here, since this will lead to a softlockup problem.
> The method is to get a blocking lock and immediately unlock before
> returning, this can avoid CPU resource waste due to lots of retries,
> and benefits fairness in getting lock among multiple nodes, increase
> efficiency in case modifying the same file frequently from multiple
> nodes.
> The softlockup problem looks like,
> Kernel panic - not syncing: softlockup: hung tasks
> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
>   
>   dump_stack+0x5c/0x82
>   panic+0xd5/0x21e
>   watchdog_timer_fn+0x208/0x210
>   ? watchdog_park_threads+0x70/0x70
>   __hrtimer_run_queues+0xcc/0x200
>   hrtimer_interrupt+0xa6/0x1f0
>   smp_apic_timer_interrupt+0x34/0x50
>   apic_timer_interrupt+0x96/0xa0
>   
>  RIP: 0010:unlock_page+0x17/0x30
>  RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>  RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>  RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>  RBP:  R08:  R09: af154080bb00
>  R10: af154080bc30 R11: 0040 R12: 993749a39518
>  R13:  R14: f21e009f5300 R15: f21e009f5300
>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>   ? pagecache_get_page+0x30/0x200
>   filemap_fault+0x12b/0x5c0
>   ? recalc_sigpending+0x17/0x50
>   ? __set_task_blocked+0x28/0x70
>   ? __set_current_blocked+0x3d/0x60
>   ocfs2_fault+0x29/0xb0 [ocfs2]
>   __do_fault+0x1a/0xa0
>   __handle_mm_fault+0xbe8/0x1090
>   handle_mm_fault+0xaa/0x1f0
>   __do_page_fault+0x235/0x4b0
>   trace_do_page_fault+0x3c/0x110
>   async_page_fault+0x28/0x30
>  RIP: 0033:0x7fa75ded638e
>  RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>  RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700
>  RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700
>  RBP: 0003 R08: 000e R09: 
>  R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770
>  R13: 000e R14: 1770 R15: 
>
> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
> Signed-off-by: Gang He 
> ---
>  fs/ocfs2/dlmglue.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 4689940..5193218 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@

Re: [Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2017-12-27 Thread piaojun

Hi Gang,

Thanks for your explaination, and I just have one more question. Could
we use 'ocfs2_inode_lock' instead of 'ocfs2_inode_lock_full' to avoid
-EAGAIN circularly?

thanks,
Jun

On 2017/12/27 18:37, Gang He wrote:
> Hi Jun,
> 
> 

>> Hi Gang,
>>
>> Do you mean that too many retrys in loop cast losts of CPU-time and
>> block page-fault interrupt? We should not add any delay in
>> ocfs2_fault(), right? And I still feel a little confused why your
>> method can solve this problem.
> You can see the related code in function filemap_fault(), if ocfs2 fails to 
> read a page since 
> it can not get a inode lock with non-block mode, the VFS layer code will 
> invoke ocfs2
> read page call back function circularly, this will lead to a softlockup 
> problem (like the below back trace).
> So, we should get a blocking lock to let the dlm lock to this node and also 
> can avoid CPU loop,
> second, base on my testing, the patch also can improve the efficiency in case 
> modifying the same
> file frequently from multiple nodes, since the lock acquisition chance is 
> more fair.
> In fact, the code was modified by a patch 1cce4df04f37 ("ocfs2: do not 
> lock/unlock() inode DLM lock"),
> before that patch, the code is the same, this patch can be considered to 
> revert that patch, except adding more
> clear comments.
>  
> Thanks
> Gang
> 
> 
>>
>> thanks,
>> Jun
>>
>> On 2017/12/27 17:29, Gang He wrote:
>>> If we can't get inode lock immediately in the function
>>> ocfs2_inode_lock_with_page() when reading a page, we should not
>>> return directly here, since this will lead to a softlockup problem.
>>> The method is to get a blocking lock and immediately unlock before
>>> returning, this can avoid CPU resource waste due to lots of retries,
>>> and benefits fairness in getting lock among multiple nodes, increase
>>> efficiency in case modifying the same file frequently from multiple
>>> nodes.
>>> The softlockup problem looks like,
>>> Kernel panic - not syncing: softlockup: hung tasks
>>> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
>>> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>>> Call Trace:
>>>   
>>>   dump_stack+0x5c/0x82
>>>   panic+0xd5/0x21e
>>>   watchdog_timer_fn+0x208/0x210
>>>   ? watchdog_park_threads+0x70/0x70
>>>   __hrtimer_run_queues+0xcc/0x200
>>>   hrtimer_interrupt+0xa6/0x1f0
>>>   smp_apic_timer_interrupt+0x34/0x50
>>>   apic_timer_interrupt+0x96/0xa0
>>>   
>>>  RIP: 0010:unlock_page+0x17/0x30
>>>  RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>>>  RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>>>  RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>>>  RBP:  R08:  R09: af154080bb00
>>>  R10: af154080bc30 R11: 0040 R12: 993749a39518
>>>  R13:  R14: f21e009f5300 R15: f21e009f5300
>>>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>>>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>>>   ? pagecache_get_page+0x30/0x200
>>>   filemap_fault+0x12b/0x5c0
>>>   ? recalc_sigpending+0x17/0x50
>>>   ? __set_task_blocked+0x28/0x70
>>>   ? __set_current_blocked+0x3d/0x60
>>>   ocfs2_fault+0x29/0xb0 [ocfs2]
>>>   __do_fault+0x1a/0xa0
>>>   __handle_mm_fault+0xbe8/0x1090
>>>   handle_mm_fault+0xaa/0x1f0
>>>   __do_page_fault+0x235/0x4b0
>>>   trace_do_page_fault+0x3c/0x110
>>>   async_page_fault+0x28/0x30
>>>  RIP: 0033:0x7fa75ded638e
>>>  RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>>>  RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700
>>>  RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700
>>>  RBP: 0003 R08: 000e R09: 
>>>  R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770
>>>  R13: 000e R14: 1770 R15: 
>>>
>>> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
>>> Signed-off-by: Gang He 
>>> ---
>>>  fs/ocfs2/dlmglue.c | 9 +
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>> index 4689940..5193218 100644
>>> --- a/fs/ocfs2/dlmglue.c
>>> +++ b/fs/ocfs2/dlmglue.c
>>> @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>>> ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
>>> if (ret == -EAGAIN) {
>>> unlock_page(page);
>>> +   /*
>>> +* If we can't get inode lock immediately, we should not return
>>> +* directly here, since this will lead to a softlockup problem.
>>> +* The method is to get a blocking lock and immediately unlock
>>> +* before returning, this can avoid CPU resource waste due to
>>> +* lots of retries, and benefits fairness in getting lock.
>>> +*/
>>> +   if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>>> +   ocfs2_inode_unlock(inode, ex);
>>>

Re: [Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2017-12-27 Thread piaojun

Hi Gang,

Do you mean that too many retrys in loop cast losts of CPU-time and
block page-fault interrupt? We should not add any delay in
ocfs2_fault(), right? And I still feel a little confused why your
method can solve this problem.

thanks,
Jun

On 2017/12/27 17:29, Gang He wrote:
> If we can't get inode lock immediately in the function
> ocfs2_inode_lock_with_page() when reading a page, we should not
> return directly here, since this will lead to a softlockup problem.
> The method is to get a blocking lock and immediately unlock before
> returning, this can avoid CPU resource waste due to lots of retries,
> and benefits fairness in getting lock among multiple nodes, increase
> efficiency in case modifying the same file frequently from multiple
> nodes.
> The softlockup problem looks like,
> Kernel panic - not syncing: softlockup: hung tasks
> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
>   
>   dump_stack+0x5c/0x82
>   panic+0xd5/0x21e
>   watchdog_timer_fn+0x208/0x210
>   ? watchdog_park_threads+0x70/0x70
>   __hrtimer_run_queues+0xcc/0x200
>   hrtimer_interrupt+0xa6/0x1f0
>   smp_apic_timer_interrupt+0x34/0x50
>   apic_timer_interrupt+0x96/0xa0
>   
>  RIP: 0010:unlock_page+0x17/0x30
>  RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>  RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>  RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>  RBP:  R08:  R09: af154080bb00
>  R10: af154080bc30 R11: 0040 R12: 993749a39518
>  R13:  R14: f21e009f5300 R15: f21e009f5300
>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>   ? pagecache_get_page+0x30/0x200
>   filemap_fault+0x12b/0x5c0
>   ? recalc_sigpending+0x17/0x50
>   ? __set_task_blocked+0x28/0x70
>   ? __set_current_blocked+0x3d/0x60
>   ocfs2_fault+0x29/0xb0 [ocfs2]
>   __do_fault+0x1a/0xa0
>   __handle_mm_fault+0xbe8/0x1090
>   handle_mm_fault+0xaa/0x1f0
>   __do_page_fault+0x235/0x4b0
>   trace_do_page_fault+0x3c/0x110
>   async_page_fault+0x28/0x30
>  RIP: 0033:0x7fa75ded638e
>  RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>  RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700
>  RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700
>  RBP: 0003 R08: 000e R09: 
>  R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770
>  R13: 000e R14: 1770 R15: 
> 
> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
> Signed-off-by: Gang He 
> ---
>  fs/ocfs2/dlmglue.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 4689940..5193218 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>   ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
>   if (ret == -EAGAIN) {
>   unlock_page(page);
> + /*
> +  * If we can't get inode lock immediately, we should not return
> +  * directly here, since this will lead to a softlockup problem.
> +  * The method is to get a blocking lock and immediately unlock
> +  * before returning, this can avoid CPU resource waste due to
> +  * lots of retries, and benefits fairness in getting lock.
> +  */
> + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> + ocfs2_inode_unlock(inode, ex);
>   ret = AOP_TRUNCATED_PAGE;
>   }
>  
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: try to reuse extent block in dealloc without meta_alloc

2017-12-27 Thread piaojun

Hi Changwei,

On 2017/12/26 15:55, Changwei Ge wrote:
> A crash issue was reported by John.
> The call trace follows:
> ocfs2_split_extent+0x1ad3/0x1b40 [ocfs2]
> ocfs2_change_extent_flag+0x33a/0x470 [ocfs2]
> ocfs2_mark_extent_written+0x172/0x220 [ocfs2]
> ocfs2_dio_end_io+0x62d/0x910 [ocfs2]
> dio_complete+0x19a/0x1a0
> do_blockdev_direct_IO+0x19dd/0x1eb0
> __blockdev_direct_IO+0x43/0x50
> ocfs2_direct_IO+0x8f/0xa0 [ocfs2]
> generic_file_direct_write+0xb2/0x170
> __generic_file_write_iter+0xc3/0x1b0
> ocfs2_file_write_iter+0x4bb/0xca0 [ocfs2]
> __vfs_write+0xae/0xf0
> vfs_write+0xb8/0x1b0
> SyS_write+0x4f/0xb0
> system_call_fastpath+0x16/0x75
> 
> The BUG code told that extent tree wants to grow but no metadata
> was reserved ahead of time.
>  From my investigation into this issue, the root cause it that although
> enough metadata is not reserved, there should be enough for following use.
> Rightmost extent is merged into its left one due to a certain times of
> marking extent written. Because during marking extent written, we got many
> physically continuous extents. At last, an empty extent showed up and the
> rightmost path is removed from extent tree.
> 
> Add a new mechanism to reuse extent block cached in dealloc which were
> just unlinked from extent tree to solve this crash issue.
> 
> Criteria is that during marking extents *written*, if extent rotation
> and merging results in unlinking extent with growing extent tree later
> without any metadata reserved ahead of time, try to reuse those extents
> in dealloc in which deleted extents are cached.
> 
> Also, this patch addresses the issue John reported that ::dw_zero_count is
> not calculated properly.

Does you patch solve two different problems? If so, I suggest split it
in two patch.

> 
> After applying this patch, the issue John reported was gone.
> Thanks for the reproducer provided by John.
> And this patch has passed ocfs2-test(29 cases) suite running by New H3C Group.
> 
> Reported-by: John Lightsey 
> Signed-off-by: Changwei Ge 
> Reviewed-by: Duan Zhang 
> ---
>   fs/ocfs2/alloc.c | 140 
> ---
>   fs/ocfs2/alloc.h |   1 +
>   fs/ocfs2/aops.c  |  14 --
>   3 files changed, 145 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
> index ab5105f..56aba96 100644
> --- a/fs/ocfs2/alloc.c
> +++ b/fs/ocfs2/alloc.c
> @@ -165,6 +165,13 @@ static int ocfs2_dinode_insert_check(struct 
> ocfs2_extent_tree *et,
>struct ocfs2_extent_rec *rec);
>   static int ocfs2_dinode_sanity_check(struct ocfs2_extent_tree *et);
>   static void ocfs2_dinode_fill_root_el(struct ocfs2_extent_tree *et);
> +
> +static int ocfs2_reuse_blk_from_dealloc(handle_t *handle,
> + struct ocfs2_extent_tree *et,
> + struct buffer_head **new_eb_bh,
> + int blk_cnt);
> +static int ocfs2_is_dealloc_empty(struct ocfs2_extent_tree *et);
> +
>   static const struct ocfs2_extent_tree_operations ocfs2_dinode_et_ops = {
>   .eo_set_last_eb_blk = ocfs2_dinode_set_last_eb_blk,
>   .eo_get_last_eb_blk = ocfs2_dinode_get_last_eb_blk,
> @@ -448,6 +455,7 @@ static void __ocfs2_init_extent_tree(struct 
> ocfs2_extent_tree *et,
>   if (!obj)
>   obj = (void *)bh->b_data;
>   et->et_object = obj;
> + et->et_dealloc = NULL;

I wonder if there is necessity setting NULL here, as we will resign it
later.

thanks,
Jun

>   
>   et->et_ops->eo_fill_root_el(et);
>   if (!et->et_ops->eo_fill_max_leaf_clusters)
> @@ -1213,8 +1221,15 @@ static int ocfs2_add_branch(handle_t *handle,
>   goto bail;
>   }
>   
> - status = ocfs2_create_new_meta_bhs(handle, et, new_blocks,
> -meta_ac, new_eb_bhs);
> + if (meta_ac) {
> + status = ocfs2_create_new_meta_bhs(handle, et, new_blocks,
> +meta_ac, new_eb_bhs);
> + } else if (!ocfs2_is_dealloc_empty(et)) {
> + status = ocfs2_reuse_blk_from_dealloc(handle, et,
> +   new_eb_bhs, new_blocks);
> + } else {
> + BUG();
> + }
>   if (status < 0) {
>   mlog_errno(status);
>   goto bail;
> @@ -1347,8 +1362,15 @@ static int ocfs2_shift_tree_depth(handle_t *handle,
>   struct ocfs2_extent_list  *root_el;
>   struct ocfs2_extent_list  *eb_el;
>   
> - status = ocfs2_create_new_meta_bhs(handle, et, 1, meta_ac,
> -_eb_bh);
> + if (meta_ac) {
> + status = ocfs2_create_new_meta_bhs(handle, et, 1, meta_ac,
> +_eb_bh);
> + } else if (!ocfs2_is_dealloc_empty(et)) {
> + status =

[Ocfs2-devel] [PATCH v3] ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid

2017-12-26 Thread piaojun

If metadata is corrupted such as 'invalid inode block', we will get
failed by calling 'mount()' and then set filesystem readonly as below:

ocfs2_mount
  ocfs2_initialize_super
ocfs2_init_global_system_inodes
  ocfs2_iget
ocfs2_read_locked_inode
  ocfs2_validate_inode_block
ocfs2_error
  ocfs2_handle_error
ocfs2_set_ro_flag(osb, 0);  // set readonly

In this situation we need return -EROFS to 'mount.ocfs2', so that user
can fix it by fsck. And then mount again. In addition, 'mount.ocfs2'
should be updated correspondingly as it only return 1 for all errno.
And I will post a patch for 'mount.ocfs2' too.

Signed-off-by: Jun Piao 
Reviewed-by: Alex Chen 
Reviewed-by: Joseph Qi 
Reviewed-by: Changwei Ge 
Reviewed-by: Gang He 
---
 fs/ocfs2/super.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 040bbb6..4e4bb27 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -474,9 +474,8 @@ static int ocfs2_init_global_system_inodes(struct 
ocfs2_super *osb)
new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
if (!new) {
ocfs2_release_system_inodes(osb);
-   status = -EINVAL;
+   status = ocfs2_is_soft_readonly(osb) ? -EROFS : -EINVAL;
mlog_errno(status);
-   /* FIXME: Should ERROR_RO_FS */
mlog(ML_ERROR, "Unable to load system inode %d, "
 "possibly corrupt fs?", i);
goto bail;
@@ -505,7 +504,7 @@ static int ocfs2_init_local_system_inodes(struct 
ocfs2_super *osb)
new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
if (!new) {
ocfs2_release_system_inodes(osb);
-   status = -EINVAL;
+   status = ocfs2_is_soft_readonly(osb) ? -EROFS : -EINVAL;
mlog(ML_ERROR, "status=%d, sysfile=%d, slot=%d\n",
 status, i, osb->slot_num);
goto bail;
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid

2017-12-26 Thread piaojun

Hi Changwei,

On 2017/12/26 19:18, Changwei Ge wrote:
> Hi Jun,
> 
> On 2017/12/26 19:10, piaojun wrote:
>> If metadata is corrupted such as 'invalid inode block', we will get
>> failed by calling 'mount()' and then set filesystem readonly as below:
>>
>> ocfs2_mount
>>ocfs2_initialize_super
>>  ocfs2_init_global_system_inodes
>>ocfs2_iget
>>  ocfs2_read_locked_inode
>>ocfs2_validate_inode_block
>>  ocfs2_error
>>ocfs2_handle_error
>>  ocfs2_set_ro_flag(osb, 0);  // set readonly
>>
>> In this situation we need return -EROFS to 'mount.ocfs2', so that user
>> can fix it by fsck and mount again rather than doing meaningless retry.
>> In addition, 'mount.ocfs2' should be updated correspondingly as it only
>> return 1 for all errno. And I will post a patch for 'mount.ocfs2' too.
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>> ---
>>   fs/ocfs2/super.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
>> index 040bbb6..f46f6a6 100644
>> --- a/fs/ocfs2/super.c
>> +++ b/fs/ocfs2/super.c
>> @@ -474,7 +474,7 @@ static int ocfs2_init_global_system_inodes(struct 
>> ocfs2_super *osb)
>>  new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
>>  if (!new) {
>>  ocfs2_release_system_inodes(osb);
>> -status = -EINVAL;
>> +status = ocfs2_is_soft_readonly(osb) ? -EROFS : -EINVAL;
>>  mlog_errno(status);
>>  /* FIXME: Should ERROR_RO_FS */
> 
> If your patch is applied, shall we remove above comment line?
> 
> Thanks,
> Changwei
> 
Good suggestion! If nobody rejects it, I will remove that comment.

thanks,
Jun

>>  mlog(ML_ERROR, "Unable to load system inode %d, "
>> @@ -505,7 +505,7 @@ static int ocfs2_init_local_system_inodes(struct 
>> ocfs2_super *osb)
>>  new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
>>  if (!new) {
>>  ocfs2_release_system_inodes(osb);
>> -status = -EINVAL;
>> +status = ocfs2_is_soft_readonly(osb) ? -EROFS : -EINVAL;
>>  mlog(ML_ERROR, "status=%d, sysfile=%d, slot=%d\n",
>>   status, i, osb->slot_num);
>>  goto bail;
>>
> 
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH v2] ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid

2017-12-26 Thread piaojun

If metadata is corrupted such as 'invalid inode block', we will get
failed by calling 'mount()' and then set filesystem readonly as below:

ocfs2_mount
  ocfs2_initialize_super
ocfs2_init_global_system_inodes
  ocfs2_iget
ocfs2_read_locked_inode
  ocfs2_validate_inode_block
ocfs2_error
  ocfs2_handle_error
ocfs2_set_ro_flag(osb, 0);  // set readonly

In this situation we need return -EROFS to 'mount.ocfs2', so that user
can fix it by fsck and mount again rather than doing meaningless retry.
In addition, 'mount.ocfs2' should be updated correspondingly as it only
return 1 for all errno. And I will post a patch for 'mount.ocfs2' too.

Signed-off-by: Jun Piao 
Reviewed-by: Alex Chen 
---
 fs/ocfs2/super.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 040bbb6..f46f6a6 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -474,7 +474,7 @@ static int ocfs2_init_global_system_inodes(struct 
ocfs2_super *osb)
new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
if (!new) {
ocfs2_release_system_inodes(osb);
-   status = -EINVAL;
+   status = ocfs2_is_soft_readonly(osb) ? -EROFS : -EINVAL;
mlog_errno(status);
/* FIXME: Should ERROR_RO_FS */
mlog(ML_ERROR, "Unable to load system inode %d, "
@@ -505,7 +505,7 @@ static int ocfs2_init_local_system_inodes(struct 
ocfs2_super *osb)
new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
if (!new) {
ocfs2_release_system_inodes(osb);
-   status = -EINVAL;
+   status = ocfs2_is_soft_readonly(osb) ? -EROFS : -EINVAL;
mlog(ML_ERROR, "status=%d, sysfile=%d, slot=%d\n",
 status, i, osb->slot_num);
goto bail;
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: fall back to buffer IO when append dio is disabled with file hole existing

2017-12-26 Thread piaojun



On 2017/12/26 16:19, alex chen wrote:
> Hi Changwei,
> 
> On 2017/12/26 15:03, Changwei Ge wrote:
>> The intention of this patch is to provide an option to ocfs2 users whether
>> to allocate disk space while doing dio write.
>>
>> Whether to make ocfs2 fall back to buffer io is up to ocfs2 users through
>> toggling append-dio feature. It rather makes ocfs2 configurable and
>> flexible.
>>
> It is too strange to make ocfs2 fall back to buffer io by toggling append-dio 
> feature.
> 
>> So if something bad happens to dio write with space allocation, we can
>> still make ocfs2 fall back to buffer io. It's an option not a mandatory
>> action.:)
> Now the ocfs2 supports fill holes during direct io whether or not supporting 
> append-dio feature
> and we can directly fix the problem.
> I think it is meaningless to provide an temporary option to turn off it.
> 
I have the same opinion with Alex. And probablly the better way is
providing a new feature for user, not just reuse 'append-dio'.
>>
>> Besides, append-dio feature is key to whether to allocate space with dio
>> writing. So writing to file hole and enlarging file(appending file) should
>> have the same reflection to append-dio feature.
>>
>> Signed-off-by: Changwei Ge 
>> ---
>>   fs/ocfs2/aops.c | 53 ++---
>>   1 file changed, 50 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> index d151632..32e60c0 100644
>> --- a/fs/ocfs2/aops.c
>> +++ b/fs/ocfs2/aops.c
>> @@ -2414,12 +2414,52 @@ static int ocfs2_dio_end_io(struct kiocb *iocb,
>>  return ret;
>>   }
>>   
>> +/*
>> + * Will look for holes and unwritten extents in the range starting at
>> + * pos for count bytes (inclusive).
>> + * Return value 1 indicates hole exists, 0 not exists, others indicate 
>> error.
>> + */
>> +static int ocfs2_range_has_holes(struct inode *inode, loff_t pos,
>> + size_t count)
>> +{
>> +int ret = 0;
>> +unsigned int extent_flags;
>> +u32 cpos, clusters, extent_len, phys_cpos;
>> +struct super_block *sb = inode->i_sb;
>> +
>> +cpos = pos >> OCFS2_SB(sb)->s_clustersize_bits;
>> +clusters = ocfs2_clusters_for_bytes(sb, pos + count) - cpos;
>> +
>> +while (clusters) {
>> +ret = ocfs2_get_clusters(inode, cpos, _cpos, _len,
>> + _flags);
>> +if (ret < 0) {
>> +mlog_errno(ret);
>> +goto out;
>> +}
>> +
>> +if (phys_cpos == 0) {
>> +ret = 1;
>> +goto out;
>> +}
>> +
>> +if (extent_len > clusters)
>> +extent_len = clusters;
>> +
>> +clusters -= extent_len;
>> +cpos += extent_len;
>> +}
>> +out:
>> +return ret;
>> +}
>> +
>>   static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>   {
>>  struct file *file = iocb->ki_filp;
>>  struct inode *inode = file->f_mapping->host;
>>  struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>  get_block_t *get_block;
>> +int ret;
>>   
>>  /*
>>   * Fallback to buffered I/O if we see an inode without
>> @@ -2429,9 +2469,16 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, 
>> struct iov_iter *iter)
>>  return 0;
>>   
>>  /* Fallback to buffered I/O if we do not support append dio. */
>> -if (iocb->ki_pos + iter->count > i_size_read(inode) &&
>> -!ocfs2_supports_append_dio(osb))
>> -return 0;
>> +if (!ocfs2_supports_append_dio(osb)) {
>> +if (iocb->ki_pos + iter->count > i_size_read(inode))
>> +return 0;
>> +
>> +ret = ocfs2_range_has_holes(inode, iocb->ki_pos, iter->count);
>> +if (ret == 1)
>> +return 0;
>> +else if (ret < 0)
>> +return ret;
>> +}
>>   
>>  if (iov_iter_rw(iter) == READ)
>>  get_block = ocfs2_lock_get_block;
>>
> 
> 
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: return -EROFS to upper if inode block is invalid

2017-12-25 Thread piaojun

Hi Joseph,

On 2017/12/26 14:59, Joseph Qi wrote:
> 
> 
> On 17/12/26 14:45, piaojun wrote:
>> Hi Joseph,
>>
>> On 2017/12/26 14:10, Joseph Qi wrote:
>>>
>>>
>>> On 17/12/26 13:35, piaojun wrote:
>>>> Hi Joseph,
>>>>
>>>> On 2017/12/26 11:05, Joseph Qi wrote:
>>>>>
>>>>>
>>>>> On 17/12/26 10:11, piaojun wrote:
>>>>>> If metadata is corrupted such as 'invalid inode block', we will get
>>>>>> failed by calling 'mount()' as below:
>>>>>>
>>>>>> ocfs2_mount
>>>>>>   ocfs2_initialize_super
>>>>>> ocfs2_init_global_system_inodes : return -EINVAL if inode is NULL
>>>>>>   ocfs2_get_system_file_inode
>>>>>> _ocfs2_get_system_file_inode : return NULL if inode is errno
>>>>> Do you mean inode is bad?
>>>>>
>>>> Here we have to face two abnormal cases:
>>>> 1. inode is bad;
>>>> 2. read inode from disk failed due to bad storage link.
>>>>>>   ocfs2_iget
>>>>>> ocfs2_read_locked_inode
>>>>>>   ocfs2_validate_inode_block
>>>>>>
>>>>>> In this situation we need return -EROFS to upper application, so that
>>>>>> user can fix it by fsck. And then mount again.
>>>>>>
>>>>>> Signed-off-by: Jun Piao <piao...@huawei.com>
>>>>>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>>>>>> ---
>>>>>>  fs/ocfs2/super.c | 10 --
>>>>>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
>>>>>> index 040bbb6..dea21a7 100644
>>>>>> --- a/fs/ocfs2/super.c
>>>>>> +++ b/fs/ocfs2/super.c
>>>>>> @@ -474,7 +474,10 @@ static int ocfs2_init_global_system_inodes(struct 
>>>>>> ocfs2_super *osb)
>>>>>>  new = ocfs2_get_system_file_inode(osb, i, 
>>>>>> osb->slot_num);
>>>>>>  if (!new) {
>>>>>>  ocfs2_release_system_inodes(osb);
>>>>>> -status = -EINVAL;
>>>>>> +if (ocfs2_is_soft_readonly(osb))
>>>>> I'm afraid that having bad inode doesn't means ocfs2 is readonly.
>>>>> And the calling application is mount.ocfs2. So do you mean mount.ocfs2
>>>>> have to handle EROFS like printing corresponding error log?
>>>>>
>>>> I agree that 'bad inode' also means other abnormal cases like
>>>> 'bad storage link' or 'no memory', but we can distinguish that by
>>>> ocfs2_is_soft_readonly(). I found that 'mount.ocfs2' did not
>>>> distinguish any error type and just return 1 for all error cases. I
>>>> wonder if we should return the exact errno for users?
>>>> Soft readonly is an in-memory status. The case you described is just
>>> trying to read inode and then check if it is bad. So where to set the
>>> status before?
>>>
>> we set readonly status in the following process:
>> ocfs2_validate_inode_block()
>>   ocfs2_error
>> ocfs2_handle_error
>>   ocfs2_set_ro_flag(osb, 0);
>>
>> I have a suggestion that we could distinguish readonly status in
>> 'mount.ocfs2', and return -EROFS to users so that they can fix it.
> IC. Please update this information to patch description as well.
> And suggest just use ternary operator instead of if/else.
> BTW, so mount.ocfs2 should be updated correspondingly, right?
> 
> Thanks,
> Joseph

Thanks for your advices, and I will post a patch for mount.ocfs2
correspondingly.

>>>> thanks,
>>>> Jun
>>>>
>>>>>> +status = -EROFS;
>>>>>> +else
>>>>>> +status = -EINVAL;
>>>>>>  mlog_errno(status);
>>>>>>  /* FIXME: Should ERROR_RO_FS */
>>>>>>  mlog(ML_ERROR, "Unable to load system inode %d, 
>>>>>> "
>>>>>> @@ -505,7 +508,10 @@ static int ocfs2_init_local_system_inodes(struct 
>>>>>> ocfs2_super *osb)
>>>>>>  new = ocfs2_get_system_file_inode(osb, i, 
>>>>>> osb->slot_num);
>>>>>>  if (!new) {
>>>>>>  ocfs2_release_system_inodes(osb);
>>>>>> -status = -EINVAL;
>>>>>> +if (ocfs2_is_soft_readonly(osb))
>>>>>> +status = -EROFS;
>>>>>> +else
>>>>>> +status = -EINVAL;
>>>>>>  mlog(ML_ERROR, "status=%d, sysfile=%d, 
>>>>>> slot=%d\n",
>>>>>>   status, i, osb->slot_num);
>>>>>>  goto bail;
>>>>>>
>>>>> .
>>>>>
>>> .
>>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: return -EROFS to upper if inode block is invalid

2017-12-25 Thread piaojun

Hi Joseph,

On 2017/12/26 14:10, Joseph Qi wrote:
> 
> 
> On 17/12/26 13:35, piaojun wrote:
>> Hi Joseph,
>>
>> On 2017/12/26 11:05, Joseph Qi wrote:
>>>
>>>
>>> On 17/12/26 10:11, piaojun wrote:
>>>> If metadata is corrupted such as 'invalid inode block', we will get
>>>> failed by calling 'mount()' as below:
>>>>
>>>> ocfs2_mount
>>>>   ocfs2_initialize_super
>>>> ocfs2_init_global_system_inodes : return -EINVAL if inode is NULL
>>>>   ocfs2_get_system_file_inode
>>>> _ocfs2_get_system_file_inode : return NULL if inode is errno
>>> Do you mean inode is bad?
>>>
>> Here we have to face two abnormal cases:
>> 1. inode is bad;
>> 2. read inode from disk failed due to bad storage link.
>>>>   ocfs2_iget
>>>> ocfs2_read_locked_inode
>>>>   ocfs2_validate_inode_block
>>>>
>>>> In this situation we need return -EROFS to upper application, so that
>>>> user can fix it by fsck. And then mount again.
>>>>
>>>> Signed-off-by: Jun Piao <piao...@huawei.com>
>>>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>>>> ---
>>>>  fs/ocfs2/super.c | 10 --
>>>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
>>>> index 040bbb6..dea21a7 100644
>>>> --- a/fs/ocfs2/super.c
>>>> +++ b/fs/ocfs2/super.c
>>>> @@ -474,7 +474,10 @@ static int ocfs2_init_global_system_inodes(struct 
>>>> ocfs2_super *osb)
>>>>new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
>>>>if (!new) {
>>>>ocfs2_release_system_inodes(osb);
>>>> -  status = -EINVAL;
>>>> +  if (ocfs2_is_soft_readonly(osb))
>>> I'm afraid that having bad inode doesn't means ocfs2 is readonly.
>>> And the calling application is mount.ocfs2. So do you mean mount.ocfs2
>>> have to handle EROFS like printing corresponding error log?
>>>
>> I agree that 'bad inode' also means other abnormal cases like
>> 'bad storage link' or 'no memory', but we can distinguish that by
>> ocfs2_is_soft_readonly(). I found that 'mount.ocfs2' did not
>> distinguish any error type and just return 1 for all error cases. I
>> wonder if we should return the exact errno for users?
>> Soft readonly is an in-memory status. The case you described is just
> trying to read inode and then check if it is bad. So where to set the
> status before?
> 
we set readonly status in the following process:
ocfs2_validate_inode_block()
  ocfs2_error
ocfs2_handle_error
  ocfs2_set_ro_flag(osb, 0);

I have a suggestion that we could distinguish readonly status in
'mount.ocfs2', and return -EROFS to users so that they can fix it.
>> thanks,
>> Jun
>>
>>>> +  status = -EROFS;
>>>> +  else
>>>> +  status = -EINVAL;
>>>>mlog_errno(status);
>>>>/* FIXME: Should ERROR_RO_FS */
>>>>mlog(ML_ERROR, "Unable to load system inode %d, "
>>>> @@ -505,7 +508,10 @@ static int ocfs2_init_local_system_inodes(struct 
>>>> ocfs2_super *osb)
>>>>new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
>>>>if (!new) {
>>>>ocfs2_release_system_inodes(osb);
>>>> -  status = -EINVAL;
>>>> +  if (ocfs2_is_soft_readonly(osb))
>>>> +  status = -EROFS;
>>>> +  else
>>>> +  status = -EINVAL;
>>>>mlog(ML_ERROR, "status=%d, sysfile=%d, slot=%d\n",
>>>> status, i, osb->slot_num);
>>>>goto bail;
>>>>
>>> .
>>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: return -EROFS to upper if inode block is invalid

2017-12-25 Thread piaojun

Hi Joseph,

On 2017/12/26 11:05, Joseph Qi wrote:
> 
> 
> On 17/12/26 10:11, piaojun wrote:
>> If metadata is corrupted such as 'invalid inode block', we will get
>> failed by calling 'mount()' as below:
>>
>> ocfs2_mount
>>   ocfs2_initialize_super
>> ocfs2_init_global_system_inodes : return -EINVAL if inode is NULL
>>   ocfs2_get_system_file_inode
>> _ocfs2_get_system_file_inode : return NULL if inode is errno
> Do you mean inode is bad?
> 
Here we have to face two abnormal cases:
1. inode is bad;
2. read inode from disk failed due to bad storage link.
>>   ocfs2_iget
>> ocfs2_read_locked_inode
>>   ocfs2_validate_inode_block
>>
>> In this situation we need return -EROFS to upper application, so that
>> user can fix it by fsck. And then mount again.
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>> ---
>>  fs/ocfs2/super.c | 10 --
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
>> index 040bbb6..dea21a7 100644
>> --- a/fs/ocfs2/super.c
>> +++ b/fs/ocfs2/super.c
>> @@ -474,7 +474,10 @@ static int ocfs2_init_global_system_inodes(struct 
>> ocfs2_super *osb)
>>  new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
>>  if (!new) {
>>  ocfs2_release_system_inodes(osb);
>> -status = -EINVAL;
>> +if (ocfs2_is_soft_readonly(osb))
> I'm afraid that having bad inode doesn't means ocfs2 is readonly.
> And the calling application is mount.ocfs2. So do you mean mount.ocfs2
> have to handle EROFS like printing corresponding error log?
> 
I agree that 'bad inode' also means other abnormal cases like
'bad storage link' or 'no memory', but we can distinguish that by
ocfs2_is_soft_readonly(). I found that 'mount.ocfs2' did not
distinguish any error type and just return 1 for all error cases. I
wonder if we should return the exact errno for users?

thanks,
Jun

>> +status = -EROFS;
>> +else
>> +status = -EINVAL;
>>  mlog_errno(status);
>>  /* FIXME: Should ERROR_RO_FS */
>>  mlog(ML_ERROR, "Unable to load system inode %d, "
>> @@ -505,7 +508,10 @@ static int ocfs2_init_local_system_inodes(struct 
>> ocfs2_super *osb)
>>  new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
>>  if (!new) {
>>  ocfs2_release_system_inodes(osb);
>> -status = -EINVAL;
>> +if (ocfs2_is_soft_readonly(osb))
>> +status = -EROFS;
>> +else
>> +status = -EINVAL;
>>  mlog(ML_ERROR, "status=%d, sysfile=%d, slot=%d\n",
>>   status, i, osb->slot_num);
>>  goto bail;
>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: return -EROFS to upper if inode block is invalid

2017-12-25 Thread piaojun

Hi Changwei,

I just want to return exact errno to users so that they can fix
read-only problem rather than doing meaningless retry.

thanks,
Jun

On 2017/12/26 11:34, Changwei Ge wrote:
> Hi Jun,
> 
> What I concern is if we don't return -EROFS to mount.ocfs2, what bad result 
> will come?
> This patch is a bug fix or something else?
> Can you elaborate your intention of this patch?
> 
> Thanks,
> Changwei
> 
> On 2017/12/26 10:14, piaojun wrote:
>> If metadata is corrupted such as 'invalid inode block', we will get
>> failed by calling 'mount()' as below:
>>
>> ocfs2_mount
>>ocfs2_initialize_super
>>  ocfs2_init_global_system_inodes : return -EINVAL if inode is NULL
>>ocfs2_get_system_file_inode
>>  _ocfs2_get_system_file_inode : return NULL if inode is errno
>>ocfs2_iget
>>  ocfs2_read_locked_inode
>>ocfs2_validate_inode_block
>>
>> In this situation we need return -EROFS to upper application, so that
>> user can fix it by fsck. And then mount again.
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>> ---
>>   fs/ocfs2/super.c | 10 --
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
>> index 040bbb6..dea21a7 100644
>> --- a/fs/ocfs2/super.c
>> +++ b/fs/ocfs2/super.c
>> @@ -474,7 +474,10 @@ static int ocfs2_init_global_system_inodes(struct 
>> ocfs2_super *osb)
>>  new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
>>  if (!new) {
>>  ocfs2_release_system_inodes(osb);
>> -status = -EINVAL;
>> +if (ocfs2_is_soft_readonly(osb))
>> +status = -EROFS;
>> +else
>> +status = -EINVAL;
>>  mlog_errno(status);
>>  /* FIXME: Should ERROR_RO_FS */
>>  mlog(ML_ERROR, "Unable to load system inode %d, "
>> @@ -505,7 +508,10 @@ static int ocfs2_init_local_system_inodes(struct 
>> ocfs2_super *osb)
>>  new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
>>  if (!new) {
>>  ocfs2_release_system_inodes(osb);
>> -status = -EINVAL;
>> +if (ocfs2_is_soft_readonly(osb))
>> +status = -EROFS;
>> +else
>> +status = -EINVAL;
>>  mlog(ML_ERROR, "status=%d, sysfile=%d, slot=%d\n",
>>   status, i, osb->slot_num);
>>  goto bail;
>>
> 
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: return -EROFS to upper if inode block is invalid

2017-12-25 Thread piaojun

Hi Gang,

Just like the dumped stack below, ocfs2_validate_inode_block() will
find out 'invalid inode' and then return error to upper callers. At
last invoke failure of ocfs2_get_system_file_inode().

thanks,
Jun

On 2017/12/26 10:22, Gang He wrote:
> Hi Piaojun,
> 
> Just one quick question, if the file system is read-only, this can make 
> ocfs2_get_system_file_inode() function invoke failure?
> If ture, I think this code change make sense.
> 
> Thanks
> Gang
> 
> 
> 
>>>>
>> If metadata is corrupted such as 'invalid inode block', we will get
>> failed by calling 'mount()' as below:
>>
>> ocfs2_mount
>>   ocfs2_initialize_super
>> ocfs2_init_global_system_inodes : return -EINVAL if inode is NULL
>>   ocfs2_get_system_file_inode
>> _ocfs2_get_system_file_inode : return NULL if inode is errno
>>   ocfs2_iget
>> ocfs2_read_locked_inode
>>   ocfs2_validate_inode_block
>>
>> In this situation we need return -EROFS to upper application, so that
>> user can fix it by fsck. And then mount again.
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>> ---
>>  fs/ocfs2/super.c | 10 --
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
>> index 040bbb6..dea21a7 100644
>> --- a/fs/ocfs2/super.c
>> +++ b/fs/ocfs2/super.c
>> @@ -474,7 +474,10 @@ static int ocfs2_init_global_system_inodes(struct 
>> ocfs2_super *osb)
>>  new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
>>  if (!new) {
>>  ocfs2_release_system_inodes(osb);
>> -status = -EINVAL;
>> +if (ocfs2_is_soft_readonly(osb))
>> +status = -EROFS;
>> +else
>> +status = -EINVAL;
>>  mlog_errno(status);
>>  /* FIXME: Should ERROR_RO_FS */
>>  mlog(ML_ERROR, "Unable to load system inode %d, "
>> @@ -505,7 +508,10 @@ static int ocfs2_init_local_system_inodes(struct 
>> ocfs2_super *osb)
>>  new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
>>  if (!new) {
>>  ocfs2_release_system_inodes(osb);
>> -status = -EINVAL;
>> +if (ocfs2_is_soft_readonly(osb))
>> +status = -EROFS;
>> +else
>> +status = -EINVAL;
>>  mlog(ML_ERROR, "status=%d, sysfile=%d, slot=%d\n",
>>   status, i, osb->slot_num);
>>  goto bail;
>> -- 
>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com 
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2: return -EROFS to upper if inode block is invalid

2017-12-25 Thread piaojun

If metadata is corrupted such as 'invalid inode block', we will get
failed by calling 'mount()' as below:

ocfs2_mount
  ocfs2_initialize_super
ocfs2_init_global_system_inodes : return -EINVAL if inode is NULL
  ocfs2_get_system_file_inode
_ocfs2_get_system_file_inode : return NULL if inode is errno
  ocfs2_iget
ocfs2_read_locked_inode
  ocfs2_validate_inode_block

In this situation we need return -EROFS to upper application, so that
user can fix it by fsck. And then mount again.

Signed-off-by: Jun Piao 
Reviewed-by: Alex Chen 
---
 fs/ocfs2/super.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 040bbb6..dea21a7 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -474,7 +474,10 @@ static int ocfs2_init_global_system_inodes(struct 
ocfs2_super *osb)
new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
if (!new) {
ocfs2_release_system_inodes(osb);
-   status = -EINVAL;
+   if (ocfs2_is_soft_readonly(osb))
+   status = -EROFS;
+   else
+   status = -EINVAL;
mlog_errno(status);
/* FIXME: Should ERROR_RO_FS */
mlog(ML_ERROR, "Unable to load system inode %d, "
@@ -505,7 +508,10 @@ static int ocfs2_init_local_system_inodes(struct 
ocfs2_super *osb)
new = ocfs2_get_system_file_inode(osb, i, osb->slot_num);
if (!new) {
ocfs2_release_system_inodes(osb);
-   status = -EINVAL;
+   if (ocfs2_is_soft_readonly(osb))
+   status = -EROFS;
+   else
+   status = -EINVAL;
mlog(ML_ERROR, "status=%d, sysfile=%d, slot=%d\n",
 status, i, osb->slot_num);
goto bail;
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: fall back to buffer IO when append dio is disabled with file hole existing

2017-12-18 Thread piaojun

Hi Changwei,

On 2017/12/19 11:05, Changwei Ge wrote:
> Hi Jun,
> 
> On 2017/12/19 9:48, piaojun wrote:
>> Hi Changwei,
>>
>> On 2017/12/18 20:06, Changwei Ge wrote:
>>> Before ocfs2 supporting allocating clusters while doing append-dio, all 
>>> append
>>> dio will fall back to buffer io to allocate clusters firstly. Also, when it
>>> steps on a file hole, it will fall back to buffer io, too. But for current
>>> code, writing to file hole will leverage dio to allocate clusters. This is 
>>> not
>>> right, since whether append-io is enabled tells the capability whether 
>>> ocfs2 can
>>> allocate space while doing dio.
>>> So introduce file hole check function back into ocfs2.
>>> Once ocfs2 is doing dio upon a file hole with append-dio disabled, it will 
>>> fall
>>> back to buffer IO to allocate clusters.
>>>
>> 1. Do you mean that filling hole can't go with dio when append-dio is 
>> disabled?
> 
> Yes, direct IO will fall back to buffer IO with _append-dio_ disabled.

Why does dio need fall back to buffer-io when append-dio disabled?
Could 'common-dio' on file hole go through direct io process? If not,
could you please point out the necessity.

> 
>> 2. Is your checking-hole just for 'append-dio' or for 'all-common-dio'?
> 
> Just for append-dio
> 

If your patch is just for 'append-dio', I wonder that will have impact
on 'common-dio'.

thanks,
Jun

>>> Signed-off-by: Changwei Ge <ge.chang...@h3c.com>
>>> ---
>>>fs/ocfs2/aops.c | 44 ++--
>>>1 file changed, 42 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>>> index d151632..a982cf6 100644
>>> --- a/fs/ocfs2/aops.c
>>> +++ b/fs/ocfs2/aops.c
>>> @@ -2414,6 +2414,44 @@ static int ocfs2_dio_end_io(struct kiocb *iocb,
>>> return ret;
>>>}
>>>
>>> +/*
>>> + * Will look for holes and unwritten extents in the range starting at
>>> + * pos for count bytes (inclusive).
>>> + */
>>> +static int ocfs2_check_range_for_holes(struct inode *inode, loff_t pos,
>>> +  size_t count)
>>> +{
>>> +   int ret = 0;
>>> +   unsigned int extent_flags;
>>> +   u32 cpos, clusters, extent_len, phys_cpos;
>>> +   struct super_block *sb = inode->i_sb;
>>> +
>>> +   cpos = pos >> OCFS2_SB(sb)->s_clustersize_bits;
>>> +   clusters = ocfs2_clusters_for_bytes(sb, pos + count) - cpos;
>>> +
>>> +   while (clusters) {
>>> +   ret = ocfs2_get_clusters(inode, cpos, _cpos, _len,
>>> +_flags);
>>> +   if (ret < 0) {
>>> +   mlog_errno(ret);
>>> +   goto out;
>>> +   }
>>> +
>>> +   if (phys_cpos == 0 || (extent_flags & OCFS2_EXT_UNWRITTEN)) {
>>> +   ret = 1;
>>> +   break;
>>> +   }
>>> +
>>> +   if (extent_len > clusters)
>>> +   extent_len = clusters;
>>> +
>>> +   clusters -= extent_len;
>>> +   cpos += extent_len;
>>> +   }
>>> +out:
>>> +   return ret;
>>> +}
>>> +
>>>static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>{
>>> struct file *file = iocb->ki_filp;
>>> @@ -2429,8 +2467,10 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, 
>>> struct iov_iter *iter)
>>> return 0;
>>>
>>> /* Fallback to buffered I/O if we do not support append dio. */
>>> -   if (iocb->ki_pos + iter->count > i_size_read(inode) &&
>>> -   !ocfs2_supports_append_dio(osb))
>>> +   if (!ocfs2_supports_append_dio(osb) &&
>>> +   (iocb->ki_pos + iter->count > i_size_read(inode) ||
>>> +ocfs2_check_range_for_holes(inode, iocb->ki_pos,
>>> +iter->count)))
>> we should check error here, right?
> 
> Accept this point.
> 
> Thanks,
> Changwei
> 
>>
>> thanks,
>> Jun
>>> return 0;
>>>
>>> if (iov_iter_rw(iter) == READ)
>>>
>>
> 
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: fall back to buffer IO when append dio is disabled with file hole existing

2017-12-18 Thread piaojun

Hi Changwei,

On 2017/12/18 20:06, Changwei Ge wrote:
> Before ocfs2 supporting allocating clusters while doing append-dio, all append
> dio will fall back to buffer io to allocate clusters firstly. Also, when it
> steps on a file hole, it will fall back to buffer io, too. But for current
> code, writing to file hole will leverage dio to allocate clusters. This is not
> right, since whether append-io is enabled tells the capability whether ocfs2 
> can
> allocate space while doing dio.
> So introduce file hole check function back into ocfs2.
> Once ocfs2 is doing dio upon a file hole with append-dio disabled, it will 
> fall
> back to buffer IO to allocate clusters.
> 
1. Do you mean that filling hole can't go with dio when append-dio is disabled?
2. Is your checking-hole just for 'append-dio' or for 'all-common-dio'?
> Signed-off-by: Changwei Ge 
> ---
>   fs/ocfs2/aops.c | 44 ++--
>   1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index d151632..a982cf6 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -2414,6 +2414,44 @@ static int ocfs2_dio_end_io(struct kiocb *iocb,
>   return ret;
>   }
>   
> +/*
> + * Will look for holes and unwritten extents in the range starting at
> + * pos for count bytes (inclusive).
> + */
> +static int ocfs2_check_range_for_holes(struct inode *inode, loff_t pos,
> +size_t count)
> +{
> + int ret = 0;
> + unsigned int extent_flags;
> + u32 cpos, clusters, extent_len, phys_cpos;
> + struct super_block *sb = inode->i_sb;
> +
> + cpos = pos >> OCFS2_SB(sb)->s_clustersize_bits;
> + clusters = ocfs2_clusters_for_bytes(sb, pos + count) - cpos;
> +
> + while (clusters) {
> + ret = ocfs2_get_clusters(inode, cpos, _cpos, _len,
> +  _flags);
> + if (ret < 0) {
> + mlog_errno(ret);
> + goto out;
> + }
> +
> + if (phys_cpos == 0 || (extent_flags & OCFS2_EXT_UNWRITTEN)) {
> + ret = 1;
> + break;
> + }
> +
> + if (extent_len > clusters)
> + extent_len = clusters;
> +
> + clusters -= extent_len;
> + cpos += extent_len;
> + }
> +out:
> + return ret;
> +}
> +
>   static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>   {
>   struct file *file = iocb->ki_filp;
> @@ -2429,8 +2467,10 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter)
>   return 0;
>   
>   /* Fallback to buffered I/O if we do not support append dio. */
> - if (iocb->ki_pos + iter->count > i_size_read(inode) &&
> - !ocfs2_supports_append_dio(osb))
> + if (!ocfs2_supports_append_dio(osb) &&
> + (iocb->ki_pos + iter->count > i_size_read(inode) ||
> +  ocfs2_check_range_for_holes(inode, iocb->ki_pos,
> +  iter->count)))
we should check error here, right?

thanks,
Jun
>   return 0;
>   
>   if (iov_iter_rw(iter) == READ)
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: fix a potential 'ABBA' deadlock caused by 'l_lock' and 'dentry_attach_lock'

2017-12-08 Thread piaojun

Hi Changwei,

On 2017/12/8 17:09, Changwei Ge wrote:
> On 2017/12/7 20:37, piaojun wrote:
>> Hi Changwei,
>>
>> On 2017/12/7 19:59, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2017/12/7 19:30, piaojun wrote:
>>>>  CPUA  CPUB
>>>>
>>>> ocfs2_dentry_convert_worker
>>>> get 'l_lock'
>>>
>>> This lock belongs to ocfs2_dentry_lock::ocfs2_lock_res::l_lock
>>>
>>>>
>>>>   get 'dentry_attach_lock'
>>>>
>>>>   interruptted by dio_end_io:
>>>> dio_end_io
>>>>   dio_bio_end_aio
>>>> dio_complete
>>>>   dio->end_io
>>>> ocfs2_dio_end_io
>>>>   ocfs2_rw_unlock
>>>>   ...
>>>>   try to get 'l_lock'
>>>
>>> This lock belongs to ocfs2_lock_res::l_lock though.
>>>
>>> So I think they are totally two different locks.
>>> So making spin_lock to interruption safe is pointless.
>>>
>>> Thanks,
>>> Changwei
>>>
>>
>> For the same lock, we need use spin_lock_irqsave to ensure irq-safe as
>> you said. But the scenario described above is another kind of deadlock
>> caused by two different locks which stuck each other. To avoid this
>> 'ABBA' deadlock we need follow the rule that 'spin_lock_irqsave' should
>> be called under 'spin_lock_irqsave'.
> 
> Hi Jun,
> I'm not sure if you understood my concern?
> I agree with that we should avoid ABBA dead lock scenario.
> But the dead lock scenario your sequence diagram is demonstrating 
> doesn't even exist.
> 
> Because CPU A is holding lock _X1_ and waiting for _dentry_attach_lock_.
> meanwhile CPU B is holding lock _dentry_attach_lock_ and waiting for _X2_.
> No ABBA deadlock condition is met, CPU B will acquire lock _X2_ without 
> being affected by CPU A.
> 
> In a nutshell, there are three locks(lock memory space) involved in your 
> diagram rather than two.
> 
> Thanks,
> Changwei
> 

Sorry for misunderstanding your point, and I realize no deadlock would
happen as these two 'l_lock' belong to different lockres. But I still
suggest merging this patch, because this would reduce the risk of
introducing new deadlock probably by programming.

thanks,
Jun

>>
>> thanks,
>> Jun
>>
>>>>   but CPUA has got it.
>>>>
>>>> try to get 'dentry_attach_lock',
>>>> but CPUB has got 'dentry_attach_lock',
>>>> and would not release it.
>>>>
>>>> so we need use spin_lock_irqsave for 'dentry_attach_lock' to prevent
>>>> interruptted by softirq.
>>>>
>>>> Signed-off-by: Jun Piao <piao...@huawei.com>
>>>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>>>> ---
>>>>fs/ocfs2/dcache.c  | 14 --
>>>>fs/ocfs2/dlmglue.c | 14 +++---
>>>>fs/ocfs2/namei.c   |  5 +++--
>>>>3 files changed, 18 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
>>>> index 2903730..6555fbf 100644
>>>> --- a/fs/ocfs2/dcache.c
>>>> +++ b/fs/ocfs2/dcache.c
>>>> @@ -230,6 +230,7 @@ int ocfs2_dentry_attach_lock(struct dentry *dentry,
>>>>int ret;
>>>>struct dentry *alias;
>>>>struct ocfs2_dentry_lock *dl = dentry->d_fsdata;
>>>> +  unsigned long flags;
>>>>
>>>>trace_ocfs2_dentry_attach_lock(dentry->d_name.len, 
>>>> dentry->d_name.name,
>>>>   (unsigned long 
>>>> long)parent_blkno, dl);
>>>> @@ -309,10 +310,10 @@ int ocfs2_dentry_attach_lock(struct dentry *dentry,
>>>>ocfs2_dentry_lock_res_init(dl, parent_blkno, inode);
>>>>
>>>>out_attach:
>>>> -  spin_lock(_attach_lock);
>>>> +  spin_lock_irqsave(_attach_lock, flags);
>>>>dentry->d_fsdata = dl;
>>>>dl->dl_count++;
>>&

Re: [Ocfs2-devel] [PATCH] ocfs2: fix a potential 'ABBA' deadlock caused by 'l_lock' and 'dentry_attach_lock'

2017-12-07 Thread piaojun

Hi Changwei,

On 2017/12/7 19:59, Changwei Ge wrote:
> Hi Jun,
> 
> On 2017/12/7 19:30, piaojun wrote:
>> CPUA  CPUB
>>
>> ocfs2_dentry_convert_worker
>> get 'l_lock'
> 
> This lock belongs to ocfs2_dentry_lock::ocfs2_lock_res::l_lock
> 
>>
>>  get 'dentry_attach_lock'
>>
>>  interruptted by dio_end_io:
>>dio_end_io
>>  dio_bio_end_aio
>>dio_complete
>>  dio->end_io
>>ocfs2_dio_end_io
>>  ocfs2_rw_unlock
>>  ...
>>  try to get 'l_lock'
> 
> This lock belongs to ocfs2_lock_res::l_lock though.
> 
> So I think they are totally two different locks.
> So making spin_lock to interruption safe is pointless.
> 
> Thanks,
> Changwei
> 

For the same lock, we need use spin_lock_irqsave to ensure irq-safe as
you said. But the scenario described above is another kind of deadlock
caused by two different locks which stuck each other. To avoid this
'ABBA' deadlock we need follow the rule that 'spin_lock_irqsave' should
be called under 'spin_lock_irqsave'.

thanks,
Jun

>>  but CPUA has got it.
>>
>> try to get 'dentry_attach_lock',
>> but CPUB has got 'dentry_attach_lock',
>> and would not release it.
>>
>> so we need use spin_lock_irqsave for 'dentry_attach_lock' to prevent
>> interruptted by softirq.
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>> ---
>>   fs/ocfs2/dcache.c  | 14 --
>>   fs/ocfs2/dlmglue.c | 14 +++---
>>   fs/ocfs2/namei.c   |  5 +++--
>>   3 files changed, 18 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
>> index 2903730..6555fbf 100644
>> --- a/fs/ocfs2/dcache.c
>> +++ b/fs/ocfs2/dcache.c
>> @@ -230,6 +230,7 @@ int ocfs2_dentry_attach_lock(struct dentry *dentry,
>>  int ret;
>>  struct dentry *alias;
>>  struct ocfs2_dentry_lock *dl = dentry->d_fsdata;
>> +unsigned long flags;
>>
>>  trace_ocfs2_dentry_attach_lock(dentry->d_name.len, dentry->d_name.name,
>> (unsigned long long)parent_blkno, dl);
>> @@ -309,10 +310,10 @@ int ocfs2_dentry_attach_lock(struct dentry *dentry,
>>  ocfs2_dentry_lock_res_init(dl, parent_blkno, inode);
>>
>>   out_attach:
>> -spin_lock(_attach_lock);
>> +spin_lock_irqsave(_attach_lock, flags);
>>  dentry->d_fsdata = dl;
>>  dl->dl_count++;
>> -spin_unlock(_attach_lock);
>> +spin_unlock_irqrestore(_attach_lock, flags);
>>
>>  /*
>>   * This actually gets us our PRMODE level lock. From now on,
>> @@ -333,9 +334,9 @@ int ocfs2_dentry_attach_lock(struct dentry *dentry,
>>  if (ret < 0 && !alias) {
>>  ocfs2_lock_res_free(>dl_lockres);
>>  BUG_ON(dl->dl_count != 1);
>> -spin_lock(_attach_lock);
>> +spin_lock_irqsave(_attach_lock, flags);
>>  dentry->d_fsdata = NULL;
>> -spin_unlock(_attach_lock);
>> +spin_unlock_irqrestore(_attach_lock, flags);
>>  kfree(dl);
>>  iput(inode);
>>  }
>> @@ -379,13 +380,14 @@ void ocfs2_dentry_lock_put(struct ocfs2_super *osb,
>> struct ocfs2_dentry_lock *dl)
>>   {
>>  int unlock = 0;
>> +unsigned long flags;
>>
>>  BUG_ON(dl->dl_count == 0);
>>
>> -spin_lock(_attach_lock);
>> +spin_lock_irqsave(_attach_lock, flags);
>>  dl->dl_count--;
>>  unlock = !dl->dl_count;
>> -spin_unlock(_attach_lock);
>> +spin_unlock_irqrestore(_attach_lock, flags);
>>
>>  if (unlock)
>>  ocfs2_drop_dentry_lock(osb, dl);
>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>> index 4689940..9bff3d2 100644
>> --- a/fs/ocfs2/dlmglue.c
>> +++ b/fs/ocfs2/dlmglue.c
>> @@ -3801,7 +3801,7 @@ static int ocfs2_dentry_convert_worker(struct 
>> ocfs2_lock_res *lockres,
>>  struct ocfs2_dentry_lock *dl = ocfs2_

[Ocfs2-devel] [PATCH] ocfs2: fix a potential 'ABBA' deadlock caused by 'l_lock' and 'dentry_attach_lock'

2017-12-07 Thread piaojun

   CPUA  CPUB

ocfs2_dentry_convert_worker
get 'l_lock'

get 'dentry_attach_lock'

interruptted by dio_end_io:
  dio_end_io
dio_bio_end_aio
  dio_complete
dio->end_io
  ocfs2_dio_end_io
ocfs2_rw_unlock
...
try to get 'l_lock'
but CPUA has got it.

try to get 'dentry_attach_lock',
but CPUB has got 'dentry_attach_lock',
and would not release it.

so we need use spin_lock_irqsave for 'dentry_attach_lock' to prevent
interruptted by softirq.

Signed-off-by: Jun Piao 
Reviewed-by: Alex Chen 
---
 fs/ocfs2/dcache.c  | 14 --
 fs/ocfs2/dlmglue.c | 14 +++---
 fs/ocfs2/namei.c   |  5 +++--
 3 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index 2903730..6555fbf 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -230,6 +230,7 @@ int ocfs2_dentry_attach_lock(struct dentry *dentry,
int ret;
struct dentry *alias;
struct ocfs2_dentry_lock *dl = dentry->d_fsdata;
+   unsigned long flags;

trace_ocfs2_dentry_attach_lock(dentry->d_name.len, dentry->d_name.name,
   (unsigned long long)parent_blkno, dl);
@@ -309,10 +310,10 @@ int ocfs2_dentry_attach_lock(struct dentry *dentry,
ocfs2_dentry_lock_res_init(dl, parent_blkno, inode);

 out_attach:
-   spin_lock(_attach_lock);
+   spin_lock_irqsave(_attach_lock, flags);
dentry->d_fsdata = dl;
dl->dl_count++;
-   spin_unlock(_attach_lock);
+   spin_unlock_irqrestore(_attach_lock, flags);

/*
 * This actually gets us our PRMODE level lock. From now on,
@@ -333,9 +334,9 @@ int ocfs2_dentry_attach_lock(struct dentry *dentry,
if (ret < 0 && !alias) {
ocfs2_lock_res_free(>dl_lockres);
BUG_ON(dl->dl_count != 1);
-   spin_lock(_attach_lock);
+   spin_lock_irqsave(_attach_lock, flags);
dentry->d_fsdata = NULL;
-   spin_unlock(_attach_lock);
+   spin_unlock_irqrestore(_attach_lock, flags);
kfree(dl);
iput(inode);
}
@@ -379,13 +380,14 @@ void ocfs2_dentry_lock_put(struct ocfs2_super *osb,
   struct ocfs2_dentry_lock *dl)
 {
int unlock = 0;
+   unsigned long flags;

BUG_ON(dl->dl_count == 0);

-   spin_lock(_attach_lock);
+   spin_lock_irqsave(_attach_lock, flags);
dl->dl_count--;
unlock = !dl->dl_count;
-   spin_unlock(_attach_lock);
+   spin_unlock_irqrestore(_attach_lock, flags);

if (unlock)
ocfs2_drop_dentry_lock(osb, dl);
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 4689940..9bff3d2 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -3801,7 +3801,7 @@ static int ocfs2_dentry_convert_worker(struct 
ocfs2_lock_res *lockres,
struct ocfs2_dentry_lock *dl = ocfs2_lock_res_dl(lockres);
struct ocfs2_inode_info *oi = OCFS2_I(dl->dl_inode);
struct dentry *dentry;
-   unsigned long flags;
+   unsigned long flags, d_flags;
int extra_ref = 0;

/*
@@ -3831,13 +3831,13 @@ static int ocfs2_dentry_convert_worker(struct 
ocfs2_lock_res *lockres,
 * flag.
 */
spin_lock_irqsave(>l_lock, flags);
-   spin_lock(_attach_lock);
+   spin_lock_irqsave(_attach_lock, d_flags);
if (!(lockres->l_flags & OCFS2_LOCK_FREEING)
&& dl->dl_count) {
dl->dl_count++;
extra_ref = 1;
}
-   spin_unlock(_attach_lock);
+   spin_unlock_irqrestore(_attach_lock, d_flags);
spin_unlock_irqrestore(>l_lock, flags);

mlog(0, "extra_ref = %d\n", extra_ref);
@@ -3850,13 +3850,13 @@ static int ocfs2_dentry_convert_worker(struct 
ocfs2_lock_res *lockres,
if (!extra_ref)
return UNBLOCK_CONTINUE;

-   spin_lock(_attach_lock);
+   spin_lock_irqsave(_attach_lock, d_flags);
while (1) {
dentry = ocfs2_find_local_alias(dl->dl_inode,
dl->dl_parent_blkno, 1);
if (!dentry)
break;
-   spin_unlock(_attach_lock);
+   spin_unlock_irqrestore(_attach_lock, d_flags);

if (S_ISDIR(dl->dl_inode->i_mode))
shrink_dcache_parent(dentry);
@@ -3874,9 +3874,9 @@ static int

Re: [Ocfs2-devel] [PATCH v2 1/3] ocfs2: add ocfs2_try_rw_lock and ocfs2_try_inode_lock

2017-11-29 Thread piaojun



On 2017/11/29 16:36, Gang He wrote:
> Add ocfs2_try_rw_lock and ocfs2_try_inode_lock functions, which
> will be used in non-block IO scenarios.
> 
> Signed-off-by: Gang He 
Reviewed-by: Jun Piao 
> ---
>  fs/ocfs2/dlmglue.c | 21 +
>  fs/ocfs2/dlmglue.h |  4 
>  2 files changed, 25 insertions(+)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 4689940..a68efa3 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -1742,6 +1742,27 @@ int ocfs2_rw_lock(struct inode *inode, int write)
>   return status;
>  }
>  
> +int ocfs2_try_rw_lock(struct inode *inode, int write)
> +{
> + int status, level;
> + struct ocfs2_lock_res *lockres;
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> +
> + mlog(0, "inode %llu try to take %s RW lock\n",
> +  (unsigned long long)OCFS2_I(inode)->ip_blkno,
> +  write ? "EXMODE" : "PRMODE");
> +
> + if (ocfs2_mount_local(osb))
> + return 0;
> +
> + lockres = _I(inode)->ip_rw_lockres;
> +
> + level = write ? DLM_LOCK_EX : DLM_LOCK_PR;
> +
> + status = ocfs2_cluster_lock(osb, lockres, level, DLM_LKF_NOQUEUE, 0);
> + return status;
> +}
> +
>  void ocfs2_rw_unlock(struct inode *inode, int write)
>  {
>   int level = write ? DLM_LOCK_EX : DLM_LOCK_PR;
> diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h
> index a7fc18b..05910fc 100644
> --- a/fs/ocfs2/dlmglue.h
> +++ b/fs/ocfs2/dlmglue.h
> @@ -116,6 +116,7 @@ void ocfs2_refcount_lock_res_init(struct ocfs2_lock_res 
> *lockres,
>  int ocfs2_create_new_inode_locks(struct inode *inode);
>  int ocfs2_drop_inode_locks(struct inode *inode);
>  int ocfs2_rw_lock(struct inode *inode, int write);
> +int ocfs2_try_rw_lock(struct inode *inode, int write);
>  void ocfs2_rw_unlock(struct inode *inode, int write);
>  int ocfs2_open_lock(struct inode *inode);
>  int ocfs2_try_open_lock(struct inode *inode, int write);
> @@ -140,6 +141,9 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>  /* 99% of the time we don't want to supply any additional flags --
>   * those are for very specific cases only. */
>  #define ocfs2_inode_lock(i, b, e) ocfs2_inode_lock_full_nested(i, b, e, 0, 
> OI_LS_NORMAL)
> +#define ocfs2_try_inode_lock(i, b, e)\
> + ocfs2_inode_lock_full_nested(i, b, e, OCFS2_META_LOCK_NOQUEUE,\
> + OI_LS_NORMAL)
>  void ocfs2_inode_unlock(struct inode *inode,
>  int ex);
>  int ocfs2_super_lock(struct ocfs2_super *osb,
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 2/3] ocfs2: add ocfs2_overwrite_io function

2017-11-27 Thread piaojun

Hi Gang,

If ocfs2_overwrite_io is only called in 'nowait' scenarios, I wonder if
we can discard 'int wait' just as ext4 does:

static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len);

thans,
Jun

On 2017/11/27 17:46, Gang He wrote:
> Add ocfs2_overwrite_io function, which is used to judge if
> overwrite allocated blocks, otherwise, the write will bring extra
> block allocation overhead.
> 
> Signed-off-by: Gang He 
> ---
>  fs/ocfs2/extent_map.c | 67 
> +++
>  fs/ocfs2/extent_map.h |  3 +++
>  2 files changed, 70 insertions(+)
> 
> diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
> index e4719e0..98bf325 100644
> --- a/fs/ocfs2/extent_map.c
> +++ b/fs/ocfs2/extent_map.c
> @@ -832,6 +832,73 @@ int ocfs2_fiemap(struct inode *inode, struct 
> fiemap_extent_info *fieinfo,
>   return ret;
>  }
>  
> +/* Is IO overwriting allocated blocks? */
> +int ocfs2_overwrite_io(struct inode *inode, u64 map_start, u64 map_len,
> +int wait)
> +{
> + int ret = 0, is_last;
> + u32 mapping_end, cpos;
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> + struct buffer_head *di_bh = NULL;
> + struct ocfs2_extent_rec rec;
> +
> + if (wait)
> + ret = ocfs2_inode_lock(inode, _bh, 0);
> + else
> + ret = ocfs2_try_inode_lock(inode, _bh, 0);
> + if (ret)
> + goto out;
> +
> + if (wait)
> + down_read(_I(inode)->ip_alloc_sem);
> + else {
> + if (!down_read_trylock(_I(inode)->ip_alloc_sem)) {
> + ret = -EAGAIN;
> + goto out_unlock1;
> + }
> + }
> +
> + if ((OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
> +((map_start + map_len) <= i_size_read(inode)))
> + goto out_unlock2;
> +
> + cpos = map_start >> osb->s_clustersize_bits;
> + mapping_end = ocfs2_clusters_for_bytes(inode->i_sb,
> +map_start + map_len);
> + is_last = 0;
> + while (cpos < mapping_end && !is_last) {
> + ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos,
> +  NULL, , _last);
> + if (ret) {
> + mlog_errno(ret);
> + goto out_unlock2;
> + }
> +
> + if (rec.e_blkno == 0ULL)
> + break;
> +
> + if (rec.e_flags & OCFS2_EXT_REFCOUNTED)
> + break;
> +
> + cpos = le32_to_cpu(rec.e_cpos) +
> + le16_to_cpu(rec.e_leaf_clusters);
> + }
> +
> + if (cpos < mapping_end)
> + ret = 1;
> +
> +out_unlock2:
> + brelse(di_bh);
> +
> + up_read(_I(inode)->ip_alloc_sem);
> +
> +out_unlock1:
> + ocfs2_inode_unlock(inode, 0);
> +
> +out:
> + return (ret ? 0 : 1);
> +}
> +
>  int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int 
> whence)
>  {
>   struct inode *inode = file->f_mapping->host;
> diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h
> index 67ea57d..fd9e86a 100644
> --- a/fs/ocfs2/extent_map.h
> +++ b/fs/ocfs2/extent_map.h
> @@ -53,6 +53,9 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, u64 
> v_blkno, u64 *p_blkno,
>  int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>u64 map_start, u64 map_len);
>  
> +int ocfs2_overwrite_io(struct inode *inode, u64 map_start, u64 map_len,
> +int wait);
> +
>  int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int 
> origin);
>  
>  int ocfs2_xattr_get_clusters(struct inode *inode, u32 v_cluster,
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/3] ocfs2: add ocfs2_try_rw_lock and ocfs2_try_inode_lock

2017-11-27 Thread piaojun

Hi Gang,

On 2017/11/27 17:46, Gang He wrote:
> Add ocfs2_try_rw_lock and ocfs2_try_inode_lock functions, which
> will be used in non-block IO scenarios.
> 
> Signed-off-by: Gang He 
> ---
>  fs/ocfs2/dlmglue.c | 22 ++
>  fs/ocfs2/dlmglue.h |  4 
>  2 files changed, 26 insertions(+)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 4689940..5cfbd04 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -1742,6 +1742,28 @@ int ocfs2_rw_lock(struct inode *inode, int write)
>   return status;
>  }
>  
> +int ocfs2_try_rw_lock(struct inode *inode, int write)
> +{
> + int status, level;
> + struct ocfs2_lock_res *lockres;
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> +
> + mlog(0, "inode %llu try to take %s RW lock\n",
> +  (unsigned long long)OCFS2_I(inode)->ip_blkno,
> +  write ? "EXMODE" : "PRMODE");
> +
> + if (ocfs2_mount_local(osb))
> + return 0;
> +
> + lockres = _I(inode)->ip_rw_lockres;
> +
> + level = write ? DLM_LOCK_EX : DLM_LOCK_PR;
> +
> + status = ocfs2_cluster_lock(OCFS2_SB(inode->i_sb), lockres, level,
> + DLM_LKF_NOQUEUE, 0);

we'd better use 'osb' instead of 'OCFS2_SB(inode->i_sb)'.

> + return status;
> +}
> +
>  void ocfs2_rw_unlock(struct inode *inode, int write)
>  {
>   int level = write ? DLM_LOCK_EX : DLM_LOCK_PR;
> diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h
> index a7fc18b..05910fc 100644
> --- a/fs/ocfs2/dlmglue.h
> +++ b/fs/ocfs2/dlmglue.h
> @@ -116,6 +116,7 @@ void ocfs2_refcount_lock_res_init(struct ocfs2_lock_res 
> *lockres,
>  int ocfs2_create_new_inode_locks(struct inode *inode);
>  int ocfs2_drop_inode_locks(struct inode *inode);
>  int ocfs2_rw_lock(struct inode *inode, int write);
> +int ocfs2_try_rw_lock(struct inode *inode, int write);
>  void ocfs2_rw_unlock(struct inode *inode, int write);
>  int ocfs2_open_lock(struct inode *inode);
>  int ocfs2_try_open_lock(struct inode *inode, int write);
> @@ -140,6 +141,9 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>  /* 99% of the time we don't want to supply any additional flags --
>   * those are for very specific cases only. */
>  #define ocfs2_inode_lock(i, b, e) ocfs2_inode_lock_full_nested(i, b, e, 0, 
> OI_LS_NORMAL)
> +#define ocfs2_try_inode_lock(i, b, e)\
> + ocfs2_inode_lock_full_nested(i, b, e, OCFS2_META_LOCK_NOQUEUE,\
> + OI_LS_NORMAL)
>  void ocfs2_inode_unlock(struct inode *inode,
>  int ex);
>  int ocfs2_super_lock(struct ocfs2_super *osb,
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!

2017-11-20 Thread piaojun

Hi John,

On 2017/11/21 8:58, Changwei Ge wrote:
> Hi John,
> It's better to paste your patch directly into message body. It's easy 
> for reviewing.
> 
> So I copied your patch below:
> 
>> The dw_zero_count tracking was assuming that w_unwritten_list would
>> always contain one element. The actual count is now tracked whenever
>> the list is extended.
>> ---
>>  fs/ocfs2/aops.c | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> index 88a31e9340a0..eb0a81368dbb 100644
>> --- a/fs/ocfs2/aops.c
>> +++ b/fs/ocfs2/aops.c
>> @@ -784,6 +784,8 @@ struct ocfs2_write_ctxt {
>>  struct ocfs2_cached_dealloc_ctxt w_dealloc;
>>  
>>  struct list_headw_unwritten_list;
>> +
>> +unsigned intw_unwritten_count;
>>  };
>>  
>>  void ocfs2_unlock_and_free_pages(struct page **pages, int num_pages)
>> @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct 
>> ocfs2_write_ctxt **wcp,
>>  
>>  ocfs2_init_dealloc_ctxt(>w_dealloc);
>>  INIT_LIST_HEAD(>w_unwritten_list);
>> +wc->w_unwritten_count = 0;
> 
> I think you don't have to evaluate ::w_unwritten_count to zero since 
> kzalloc already did that.
> 
>>  
>>  *wcp = wc;
>>  
>> @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode *inode,
>>  desc->c_clear_unwritten = 0;
>>  list_add_tail(>ue_ip_node, >ip_unwritten_list);
>>  list_add_tail(>ue_node, >w_unwritten_list);
>> +wc->w_unwritten_count++;
> 
> You increase ::w_unwritten_coun once a new _ue_ is attached to 
> ::w_unwritten_list. So if no _ue_ ever is attached, ::w_unwritten_list 
> is still empty. I think your change has the same effect with origin.
> 
> Moreover I don't see the relation between the reported crash issue and 
> your patch change. Can you elaborate further?
> 
> Thanks,
> Changwei
> 
>>  new = NULL;
>>  unlock:
>>  spin_unlock(>ip_lock);
>> @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode *inode, 
>> sector_t iblock,
>>  ue->ue_phys = desc->c_phys;
>>  
>>  list_splice_tail_init(>w_unwritten_list, 
>> >dw_zero_list);
>> -dwc->dw_zero_count++;
>> +dwc->dw_zero_count += wc->w_unwritten_count;

I prefer using a loop to calculate 'dwc->dw_zero_count' rather than
introducing a new variable as below:

list_for_each(iter, >w_unwritten_list)
dwc->dw_zero_count++;

thanks,
Jun

>>  }
>>  
>>  ret = ocfs2_write_end_nolock(inode->i_mapping, pos, len, len, wc);
>> -- 
>> 2.11.0
> 
> 
> 
> On 2017/11/21 2:56, John Lightsey wrote:
>> In January Ben Hutchings reported Debian bug 841144 to the ocfs2-devel
>> list:
>>
>> https://oss.oracle.com/pipermail/ocfs2-devel/2017-January/012701.html
>>
>> cPanel encountered this bug after upgrading our cluster to the 4.9
>> Debian stable kernel. In our environment, the bug would trigger every
>> few hours.
>>
>> The core problem seems to be that the size of dw_zero_list is not
>> tracked correctly. This causes the ocfs2_lock_allocators() call in
>> ocfs2_dio_end_io_write() to underestimate the number of extents needed.
>> As a result, meta_ac is null when it's needed in ocfs2_grow_tree().
>>
>> The attached patch is a forward-ported version of the fix we applied to
>> Debian's 4.9 kernel to correct the issue.
>>
> 
> 
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH v3] ocfs2/dlm: wait for dlm recovery done when migrating all lockres

2017-11-05 Thread piaojun

wait for dlm recovery done when migrating all lockres in case of new
lockres to be left after leaving dlm domain.

  NodeA   NodeBNodeC

umount and migrate
all lockres

 node down

do recovery for NodeB
and collect a new lockres
form other live nodes

leave domain but the
new lockres remains

  mount and join domain

  request for the owner
  of the new lockres, but
  all the other nodes said
  'NO', so NodeC decide to
  the owner, and send do
  assert msg to other nodes.

  other nodes receive the msg
  and found two masters exist.
  at last cause BUG in
  dlm_assert_master_handler()
  -->BUG();

Fixes: bc9838c4d44a ("dlm: allow dlm do recovery during shutdown")

Signed-off-by: Jun Piao 
Reviewed-by: Alex Chen 
Reviewed-by: Yiwen Jiang 
---
 fs/ocfs2/dlm/dlmcommon.h   |  1 +
 fs/ocfs2/dlm/dlmdomain.c   | 14 ++
 fs/ocfs2/dlm/dlmrecovery.c | 13 ++---
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
index e9f3705..999ab7d 100644
--- a/fs/ocfs2/dlm/dlmcommon.h
+++ b/fs/ocfs2/dlm/dlmcommon.h
@@ -140,6 +140,7 @@ struct dlm_ctxt
u8 node_num;
u32 key;
u8  joining_node;
+   u8 migrate_done; /* set to 1 means node has migrated all lockres */
wait_queue_head_t dlm_join_events;
unsigned long live_nodes_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
unsigned long domain_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index e1fea14..98a8f56 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -461,6 +461,18 @@ static int dlm_migrate_all_locks(struct dlm_ctxt *dlm)
cond_resched_lock(>spinlock);
num += n;
}
+
+   if (!num) {
+   if (dlm->reco.state & DLM_RECO_STATE_ACTIVE) {
+   mlog(0, "%s: perhaps there are more lock resources need 
to "
+   "be migrated after dlm recovery\n", 
dlm->name);
+   ret = -EAGAIN;
+   } else {
+   mlog(0, "%s: we won't do dlm recovery after migrating 
all lockres",
+   dlm->name);
+   dlm->migrate_done = 1;
+   }
+   }
spin_unlock(>spinlock);
wake_up(>dlm_thread_wq);

@@ -2052,6 +2064,8 @@ static struct dlm_ctxt *dlm_alloc_ctxt(const char *domain,
dlm->joining_node = DLM_LOCK_RES_OWNER_UNKNOWN;
init_waitqueue_head(>dlm_join_events);

+   dlm->migrate_done = 0;
+
dlm->reco.new_master = O2NM_INVALID_NODE_NUM;
dlm->reco.dead_node = O2NM_INVALID_NODE_NUM;

diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index 74407c6..c4cf682 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -423,12 +423,11 @@ void dlm_wait_for_recovery(struct dlm_ctxt *dlm)

 static void dlm_begin_recovery(struct dlm_ctxt *dlm)
 {
-   spin_lock(>spinlock);
+   assert_spin_locked(>spinlock);
BUG_ON(dlm->reco.state & DLM_RECO_STATE_ACTIVE);
printk(KERN_NOTICE "o2dlm: Begin recovery on domain %s for node %u\n",
   dlm->name, dlm->reco.dead_node);
dlm->reco.state |= DLM_RECO_STATE_ACTIVE;
-   spin_unlock(>spinlock);
 }

 static void dlm_end_recovery(struct dlm_ctxt *dlm)
@@ -456,6 +455,13 @@ static int dlm_do_recovery(struct dlm_ctxt *dlm)

spin_lock(>spinlock);

+   if (dlm->migrate_done) {
+   mlog(0, "%s: no need do recovery after migrating all lockres\n",
+   dlm->name);
+   spin_unlock(>spinlock);
+   return 0;
+   }
+
/* check to see if the new master has died */
if (dlm->reco.new_master != O2NM_INVALID_NODE_NUM &&
test_bit(dlm->reco.new_master, dlm->recovery_map)) {
@@ -490,12 +496,13 @@ static int dlm_do_recovery(struct dlm_ctxt *dlm)
mlog(0, "%s(%d):recovery thread found node %u in the recovery map!\n",
 dlm->name, task_pid_nr(dlm->dlm_reco_thread_task),
 dlm->reco.dead_node);
-   spin_unlock(>spinlock);

/* take write barrier */
/* (stops the list reshuffling thread, proxy ast handling) */

[Ocfs2-devel] [PATCH v2] ocfs2/dlm: wait for dlm recovery done when migrating all lockres

2017-11-02 Thread piaojun

wait for dlm recovery done when migrating all lockres in case of new
lockres to be left after leaving dlm domain.

  NodeA   NodeBNodeC

umount and migrate
all lockres

 node down

do recovery for NodeB
and collect a new lockres
form other live nodes

leave domain but the
new lockres remains

  mount and join domain

  request for the owner
  of the new lockres, but
  all the other nodes said
  'NO', so NodeC decide to
  the owner, and send do
  assert msg to other nodes.

  other nodes receive the msg
  and found two masters exist.
  at last cause BUG in
  dlm_assert_master_handler()
  -->BUG();

Fixes: bc9838c4d44a ("dlm: allow dlm do recovery during shutdown")

Signed-off-by: Jun Piao 
Reviewed-by: Alex Chen 
Reviewed-by: Yiwen Jiang 
Acked-by: Changwei Ge 
---
 fs/ocfs2/dlm/dlmcommon.h   |  1 +
 fs/ocfs2/dlm/dlmdomain.c   | 14 ++
 fs/ocfs2/dlm/dlmrecovery.c | 13 ++---
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
index e9f3705..999ab7d 100644
--- a/fs/ocfs2/dlm/dlmcommon.h
+++ b/fs/ocfs2/dlm/dlmcommon.h
@@ -140,6 +140,7 @@ struct dlm_ctxt
u8 node_num;
u32 key;
u8  joining_node;
+   u8 migrate_done; /* set to 1 means node has migrated all lockres */
wait_queue_head_t dlm_join_events;
unsigned long live_nodes_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
unsigned long domain_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index e1fea14..98a8f56 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -461,6 +461,18 @@ static int dlm_migrate_all_locks(struct dlm_ctxt *dlm)
cond_resched_lock(>spinlock);
num += n;
}
+
+   if (!num) {
+   if (dlm->reco.state & DLM_RECO_STATE_ACTIVE) {
+   mlog(0, "%s: perhaps there are more lock resources need 
to "
+   "be migrated after dlm recovery\n", 
dlm->name);
+   ret = -EAGAIN;
+   } else {
+   mlog(0, "%s: we won't do dlm recovery after migrating 
all lockres",
+   dlm->name);
+   dlm->migrate_done = 1;
+   }
+   }
spin_unlock(>spinlock);
wake_up(>dlm_thread_wq);

@@ -2052,6 +2064,8 @@ static struct dlm_ctxt *dlm_alloc_ctxt(const char *domain,
dlm->joining_node = DLM_LOCK_RES_OWNER_UNKNOWN;
init_waitqueue_head(>dlm_join_events);

+   dlm->migrate_done = 0;
+
dlm->reco.new_master = O2NM_INVALID_NODE_NUM;
dlm->reco.dead_node = O2NM_INVALID_NODE_NUM;

diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index 74407c6..c4cf682 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -423,12 +423,11 @@ void dlm_wait_for_recovery(struct dlm_ctxt *dlm)

 static void dlm_begin_recovery(struct dlm_ctxt *dlm)
 {
-   spin_lock(>spinlock);
+   assert_spin_locked(>spinlock);
BUG_ON(dlm->reco.state & DLM_RECO_STATE_ACTIVE);
printk(KERN_NOTICE "o2dlm: Begin recovery on domain %s for node %u\n",
   dlm->name, dlm->reco.dead_node);
dlm->reco.state |= DLM_RECO_STATE_ACTIVE;
-   spin_unlock(>spinlock);
 }

 static void dlm_end_recovery(struct dlm_ctxt *dlm)
@@ -456,6 +455,13 @@ static int dlm_do_recovery(struct dlm_ctxt *dlm)

spin_lock(>spinlock);

+   if (dlm->migrate_done) {
+   mlog(0, "%s: no need do recovery after migrating all lockres\n",
+   dlm->name);
+   spin_unlock(>spinlock);
+   return 0;
+   }
+
/* check to see if the new master has died */
if (dlm->reco.new_master != O2NM_INVALID_NODE_NUM &&
test_bit(dlm->reco.new_master, dlm->recovery_map)) {
@@ -490,12 +496,13 @@ static int dlm_do_recovery(struct dlm_ctxt *dlm)
mlog(0, "%s(%d):recovery thread found node %u in the recovery map!\n",
 dlm->name, task_pid_nr(dlm->dlm_reco_thread_task),
 dlm->reco.dead_node);
-   spin_unlock(>spinlock);

/* take write barrier */
/* (stops the list

Re: [Ocfs2-devel] question about ocfs2-devel

2017-11-02 Thread piaojun

Hi Gang,

thanks for replying, and perhaps there is something wrong just now.
I will send patch again.

thanks,
Jun

On 2017/11/3 10:35, Gang He wrote:
> We can got this mail.
> 
> Thanks
> Gang
> 
> 

>> Hi,
>>
>> I failed to send mail to ocfs2-devel@oss.oracle.com, who kowns
>> the reason.
>>
>> thanks,
>> Jun
>>
>> To: piao...@huawei.com 
>>
>> This message is from the Forcepoint email protection system, TRITON AP-EMAIL 
>> at host huawei.com.
>>
>> The attached email could not be delivered to one or more recipients.
>>
>> For assistance, please contact your email system administrator and include 
>> the following problem report:
>>
>> : host userp2030.oracle.com[156.151.31.89] said:
>> 550 5.7.1 Relaying denied (in reply to RCPT TO command)
>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com 
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] question about ocfs2-devel

2017-11-02 Thread piaojun

Hi,

I failed to send mail to ocfs2-devel@oss.oracle.com, who kowns
the reason.

thanks,
Jun

To: piao...@huawei.com

This message is from the Forcepoint email protection system, TRITON AP-EMAIL at 
host huawei.com.

The attached email could not be delivered to one or more recipients.

For assistance, please contact your email system administrator and include the 
following problem report:

: host userp2030.oracle.com[156.151.31.89] said:
550 5.7.1 Relaying denied (in reply to RCPT TO command)

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: wait for dlm recovery done when migrating all lockres

2017-11-02 Thread piaojun

Hi Changwei,

please see my last mail said:

"
if res is marked as DLM RECOVERING, migrating process will wait for
recoverying done. and DLM RECOVERING will be cleared after recoverying.
"

and if there is still lockres, 'migrate_done' won't be set. moreover if
other node deaded after migrating, I just let another node to do
recovery.

thanks,
Jun

On 2017/11/2 9:56, Changwei Ge wrote:
> On 2017/11/2 9:44, piaojun wrote:
>> Hi Changwei,
>>
>> I had tried a solution like yours before, but failed to prevent the
>> race just by 'dlm_state' and the existed variable as
>> 'DLM_CTXT_IN_SHUTDOWN' contains many status. so I think we need
>> introduce a 'migrate_done' to solve that problem.
> Hi Jun,
> 
> Yes, adding a new flag might be a direction, but I still think we need 
> to clear other nodes' lock resources' flag - DLM_LOCK_RES_RECOVERING, 
> which depends on NodeA's dlm recovering progress. Unfortunately, it is 
> interrupted by the newly added flag ::migrate_done in your patch. :-(
> 
> So no DLM_FINALIZE_RECO_MSG message will be sent out to other nodes, 
> thus DLM_LOCK_RES_RECOVERING can't be cleared.
> 
> As I know, if DLM_LOCK_RES_RECOVERING is set, all lock and unlock 
> requests will be *hang*.
> 
> Thanks,
> Changwei
> 
>>
>> thanks,
>> Jun
>>
>> On 2017/11/1 17:00, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2017/11/1 16:46, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> I do think we need follow the principle that use 'dlm_domain_lock' to
>>>> protect 'dlm_state' as the NOTE says in 'dlm_ctxt':
>>>> /* NOTE: Next three are protected by dlm_domain_lock */
>>>>
>>>> deadnode won't be cleared from 'dlm->domain_map' if return from
>>>> __dlm_hb_node_down(), and NodeA will retry migrating to NodeB forever
>>>> if only NodeA and NodeB in domain. I suggest more testing needed in
>>>> your solution.
>>>
>>> I agree, however, my patch is just a draft to indicate my comments.
>>>
>>> Perhaps we can figure out a better way to solve this, as your patch
>>> can't clear DLM RECOVERING flag on lock resource. I am not sure if it is
>>> reasonable, I suppose this may violate ocfs2/dlm design philosophy.
>>>
>>> Thanks,
>>> Changwei
>>>
>>
>> if res is marked as DLM RECOVERING, migrating process will wait for
>> recoverying done. and DLM RECOVERING will be cleared after recoverying.
>>
>>>>
>>>> thanks,
>>>> Jun
>>>>
>>>> On 2017/11/1 16:11, Changwei Ge wrote:
>>>>> Hi Jun,
>>>>>
>>>>> Thanks for reviewing.
>>>>> I don't think we have to worry about misusing *dlm_domain_lock* and
>>>>> *dlm::spinlock*. I admit my change may look a little different from most
>>>>> of other code snippets where using these two spin locks. But our purpose
>>>>> is to close the race between __dlm_hb_node_down and
>>>>> dlm_unregister_domain, right?  And my change meets that. :-)
>>>>>
>>>>> I suppose we can do it in a flexible way.
>>>>>
>>>>> Thanks,
>>>>> Changwei
>>>>>
>>>>>
>>>>> On 2017/11/1 15:57, piaojun wrote:
>>>>>> Hi Changwei,
>>>>>>
>>>>>> thanks for reviewing, and I think waiting for recoverying done before
>>>>>> migrating seems another solution, but I wonder if new problems will be
>>>>>> invoked as following comments.
>>>>>>
>>>>>> thanks,
>>>>>> Jun
>>>>>>
>>>>>> On 2017/11/1 15:13, Changwei Ge wrote:
>>>>>>> Hi Jun,
>>>>>>>
>>>>>>> I probably get your point.
>>>>>>>
>>>>>>> You mean that dlm finds no lock resource to be migrated and no more lock
>>>>>>> resource is managed by its hash table. After that a node dies all of a
>>>>>>> sudden and the dead node is put into dlm's recovery map, right?
>>>>>> that is it.
>>>>>>> Furthermore, a lock resource is migrated from other nodes and local node
>>>>>>> has already asserted master to them.
>>>>>>>
>>>>>>> If so, I want to suggest a easier way to solve it.
>>>>>>> We don't have to add a new flag to dlm structure, just leverage existed
>>&

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: wait for dlm recovery done when migrating all lockres

2017-11-01 Thread piaojun

Hi Changwei,

I had tried a solution like yours before, but failed to prevent the
race just by 'dlm_state' and the existed variable as
'DLM_CTXT_IN_SHUTDOWN' contains many status. so I think we need
introduce a 'migrate_done' to solve that problem.

thanks,
Jun

On 2017/11/1 17:00, Changwei Ge wrote:
> Hi Jun,
> 
> On 2017/11/1 16:46, piaojun wrote:
>> Hi Changwei,
>>
>> I do think we need follow the principle that use 'dlm_domain_lock' to
>> protect 'dlm_state' as the NOTE says in 'dlm_ctxt':
>> /* NOTE: Next three are protected by dlm_domain_lock */
>>
>> deadnode won't be cleared from 'dlm->domain_map' if return from
>> __dlm_hb_node_down(), and NodeA will retry migrating to NodeB forever
>> if only NodeA and NodeB in domain. I suggest more testing needed in
>> your solution.
> 
> I agree, however, my patch is just a draft to indicate my comments.
> 
> Perhaps we can figure out a better way to solve this, as your patch 
> can't clear DLM RECOVERING flag on lock resource. I am not sure if it is 
> reasonable, I suppose this may violate ocfs2/dlm design philosophy.
> 
> Thanks,
> Changwei
> 

if res is marked as DLM RECOVERING, migrating process will wait for
recoverying done. and DLM RECOVERING will be cleared after recoverying.

>>
>> thanks,
>> Jun
>>
>> On 2017/11/1 16:11, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> Thanks for reviewing.
>>> I don't think we have to worry about misusing *dlm_domain_lock* and
>>> *dlm::spinlock*. I admit my change may look a little different from most
>>> of other code snippets where using these two spin locks. But our purpose
>>> is to close the race between __dlm_hb_node_down and
>>> dlm_unregister_domain, right?  And my change meets that. :-)
>>>
>>> I suppose we can do it in a flexible way.
>>>
>>> Thanks,
>>> Changwei
>>>
>>>
>>> On 2017/11/1 15:57, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> thanks for reviewing, and I think waiting for recoverying done before
>>>> migrating seems another solution, but I wonder if new problems will be
>>>> invoked as following comments.
>>>>
>>>> thanks,
>>>> Jun
>>>>
>>>> On 2017/11/1 15:13, Changwei Ge wrote:
>>>>> Hi Jun,
>>>>>
>>>>> I probably get your point.
>>>>>
>>>>> You mean that dlm finds no lock resource to be migrated and no more lock
>>>>> resource is managed by its hash table. After that a node dies all of a
>>>>> sudden and the dead node is put into dlm's recovery map, right?
>>>> that is it.
>>>>> Furthermore, a lock resource is migrated from other nodes and local node
>>>>> has already asserted master to them.
>>>>>
>>>>> If so, I want to suggest a easier way to solve it.
>>>>> We don't have to add a new flag to dlm structure, just leverage existed
>>>>> dlm status and bitmap.
>>>>> It will bring a bonus - no lock resource will be marked with RECOVERING,
>>>>> it's a safer way, I suppose.
>>>>>
>>>>> Please take a review.
>>>>>
>>>>> Thanks,
>>>>> Changwei
>>>>>
>>>>>
>>>>> Subject: [PATCH] ocfs2/dlm: a node can't be involved in recovery if it
>>>>> is being shutdown
>>>>>
>>>>> Signed-off-by: Changwei Ge <ge.chang...@h3c.com>
>>>>> ---
>>>>> fs/ocfs2/dlm/dlmdomain.c   | 4 
>>>>> fs/ocfs2/dlm/dlmrecovery.c | 3 +++
>>>>> 2 files changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
>>>>> index a2b19fbdcf46..5e9283e509a4 100644
>>>>> --- a/fs/ocfs2/dlm/dlmdomain.c
>>>>> +++ b/fs/ocfs2/dlm/dlmdomain.c
>>>>> @@ -707,11 +707,15 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm)
>>>>>* want new domain joins to communicate with us at
>>>>>* least until we've completed migration of our
>>>>>* resources. */
>>>>> + spin_lock(>spinlock);
>>>>>   dlm->dlm_state = DLM_CTXT_IN_SHUTDOWN;
>>>>> + spin_unlock(>spinlock);
>>>> I guess there will be misuse of 'dlm->spinlock' and dlm_domain_lock.
&g

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: wait for dlm recovery done when migrating all lockres

2017-11-01 Thread piaojun

Hi Changwei,

I do think we need follow the principle that use 'dlm_domain_lock' to
protect 'dlm_state' as the NOTE says in 'dlm_ctxt':
/* NOTE: Next three are protected by dlm_domain_lock */

deadnode won't be cleared from 'dlm->domain_map' if return from
__dlm_hb_node_down(), and NodeA will retry migrating to NodeB forever
if only NodeA and NodeB in domain. I suggest more testing needed in
your solution.

thanks,
Jun

On 2017/11/1 16:11, Changwei Ge wrote:
> Hi Jun,
> 
> Thanks for reviewing.
> I don't think we have to worry about misusing *dlm_domain_lock* and 
> *dlm::spinlock*. I admit my change may look a little different from most 
> of other code snippets where using these two spin locks. But our purpose 
> is to close the race between __dlm_hb_node_down and 
> dlm_unregister_domain, right?  And my change meets that. :-)
> 
> I suppose we can do it in a flexible way.
> 
> Thanks,
> Changwei
> 
> 
> On 2017/11/1 15:57, piaojun wrote:
>> Hi Changwei,
>>
>> thanks for reviewing, and I think waiting for recoverying done before
>> migrating seems another solution, but I wonder if new problems will be
>> invoked as following comments.
>>
>> thanks,
>> Jun
>>
>> On 2017/11/1 15:13, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> I probably get your point.
>>>
>>> You mean that dlm finds no lock resource to be migrated and no more lock
>>> resource is managed by its hash table. After that a node dies all of a
>>> sudden and the dead node is put into dlm's recovery map, right?
>> that is it.
>>> Furthermore, a lock resource is migrated from other nodes and local node
>>> has already asserted master to them.
>>>
>>> If so, I want to suggest a easier way to solve it.
>>> We don't have to add a new flag to dlm structure, just leverage existed
>>> dlm status and bitmap.
>>> It will bring a bonus - no lock resource will be marked with RECOVERING,
>>> it's a safer way, I suppose.
>>>
>>> Please take a review.
>>>
>>> Thanks,
>>> Changwei
>>>
>>>
>>> Subject: [PATCH] ocfs2/dlm: a node can't be involved in recovery if it
>>> is being shutdown
>>>
>>> Signed-off-by: Changwei Ge <ge.chang...@h3c.com>
>>> ---
>>>fs/ocfs2/dlm/dlmdomain.c   | 4 
>>>fs/ocfs2/dlm/dlmrecovery.c | 3 +++
>>>2 files changed, 7 insertions(+)
>>>
>>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
>>> index a2b19fbdcf46..5e9283e509a4 100644
>>> --- a/fs/ocfs2/dlm/dlmdomain.c
>>> +++ b/fs/ocfs2/dlm/dlmdomain.c
>>> @@ -707,11 +707,15 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm)
>>>  * want new domain joins to communicate with us at
>>>  * least until we've completed migration of our
>>>  * resources. */
>>> +   spin_lock(>spinlock);
>>> dlm->dlm_state = DLM_CTXT_IN_SHUTDOWN;
>>> +   spin_unlock(>spinlock);
>> I guess there will be misuse of 'dlm->spinlock' and dlm_domain_lock.
>>> leave = 1;
>>> }
>>> spin_unlock(_domain_lock);
>>>
>>> +   dlm_wait_for_recovery(dlm);
>>> +
>>> if (leave) {
>>> mlog(0, "shutting down domain %s\n", dlm->name);
>>> dlm_begin_exit_domain(dlm);
>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
>>> index 74407c6dd592..764c95b2b35c 100644
>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>> @@ -2441,6 +2441,9 @@ static void __dlm_hb_node_down(struct dlm_ctxt
>>> *dlm, int idx)
>>>{
>>> assert_spin_locked(>spinlock);
>>>
>>> +   if (dlm->dlm_state == DLM_CTXT_IN_SHUTDOWN)
>>> +   return;
>>> +
>> 'dlm->dlm_state' probably need be to protected by 'dlm_domain_lock'.
>> and I wander if there is more work to be done when in
>> 'DLM_CTXT_IN_SHUTDOWN'?
>>> if (dlm->reco.new_master == idx) {
>>> mlog(0, "%s: recovery master %d just died\n",
>>>  dlm->name, idx);
>>>
>>
> 
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: wait for dlm recovery done when migrating all lockres

2017-11-01 Thread piaojun

Hi Changwei,

thanks for reviewing, and I think waiting for recoverying done before
migrating seems another solution, but I wonder if new problems will be
invoked as following comments.

thanks,
Jun

On 2017/11/1 15:13, Changwei Ge wrote:
> Hi Jun,
> 
> I probably get your point.
> 
> You mean that dlm finds no lock resource to be migrated and no more lock 
> resource is managed by its hash table. After that a node dies all of a 
> sudden and the dead node is put into dlm's recovery map, right? 
that is it.
> Furthermore, a lock resource is migrated from other nodes and local node 
> has already asserted master to them.
> 
> If so, I want to suggest a easier way to solve it.
> We don't have to add a new flag to dlm structure, just leverage existed 
> dlm status and bitmap.
> It will bring a bonus - no lock resource will be marked with RECOVERING, 
> it's a safer way, I suppose.
> 
> Please take a review.
> 
> Thanks,
> Changwei
> 
> 
> Subject: [PATCH] ocfs2/dlm: a node can't be involved in recovery if it 
> is being shutdown
> 
> Signed-off-by: Changwei Ge 
> ---
>   fs/ocfs2/dlm/dlmdomain.c   | 4 
>   fs/ocfs2/dlm/dlmrecovery.c | 3 +++
>   2 files changed, 7 insertions(+)
> 
> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
> index a2b19fbdcf46..5e9283e509a4 100644
> --- a/fs/ocfs2/dlm/dlmdomain.c
> +++ b/fs/ocfs2/dlm/dlmdomain.c
> @@ -707,11 +707,15 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm)
>* want new domain joins to communicate with us at
>* least until we've completed migration of our
>* resources. */
> + spin_lock(>spinlock);
>   dlm->dlm_state = DLM_CTXT_IN_SHUTDOWN;
> + spin_unlock(>spinlock);
I guess there will be misuse of 'dlm->spinlock' and dlm_domain_lock.
>   leave = 1;
>   }
>   spin_unlock(_domain_lock);
> 
> + dlm_wait_for_recovery(dlm);
> +
>   if (leave) {
>   mlog(0, "shutting down domain %s\n", dlm->name);
>   dlm_begin_exit_domain(dlm);
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 74407c6dd592..764c95b2b35c 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2441,6 +2441,9 @@ static void __dlm_hb_node_down(struct dlm_ctxt 
> *dlm, int idx)
>   {
>   assert_spin_locked(>spinlock);
> 
> + if (dlm->dlm_state == DLM_CTXT_IN_SHUTDOWN)
> + return;
> +
'dlm->dlm_state' probably need be to protected by 'dlm_domain_lock'.
and I wander if there is more work to be done when in
'DLM_CTXT_IN_SHUTDOWN'?
>   if (dlm->reco.new_master == idx) {
>   mlog(0, "%s: recovery master %d just died\n",
>dlm->name, idx);
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: wait for dlm recovery done when migrating all lockres

2017-10-31 Thread piaojun

Hi Changwei,

On 2017/11/1 10:47, Changwei Ge wrote:
> Hi Jun,
> 
> Thanks for reporting.
> I am very interesting in this issue. But, first of all, I want to make 
> this issue clear, so that I might be able to provide some comments.
> 
> 
> On 2017/11/1 9:16, piaojun wrote:
>> wait for dlm recovery done when migrating all lockres in case of new
>> lockres to be left after leaving dlm domain.
> 
> What do you mean by 'a new lock resource to be left after leaving 
> domain'? It means we leak a dlm lock resource if below situation happens.
> 
a new lockres is the one collected by NodeA during recoverying for
NodeB. It leaks a lockres indeed.
>>
>>NodeA   NodeBNodeC
>>
>> umount and migrate
>> all lockres
>>
>>   node down
>>
>> do recovery for NodeB
>> and collect a new lockres
>> form other live nodes
> 
> You mean a lock resource whose owner was NodeB is just migrated from 
> other cluster member nodes?
> 
that is it.
>>
>> leave domain but the
>> new lockres remains
>>
>>mount and join domain
>>
>>request for the owner
>>of the new lockres, but
>>all the other nodes said
>>'NO', so NodeC decide to
>>the owner, and send do
>>assert msg to other nodes.
>>
>>other nodes receive the 
>> msg
>>and found two masters 
>> exist.
>>at last cause BUG in
>>
>> dlm_assert_master_handler()
>>-->BUG();
> 
> If this issue truly exists, can we take some efforts in 
> dlm_exit_domain_handler? Or perhaps we should kick dlm's work queue 
> before migrating all lock resources.
> 
If NodeA has entered dlm_leave_domain(), we can hardly go back
migrating res. Perhaps more work will be needed in that way.
>>
>> Fixes: bc9838c4d44a ("dlm: allow dlm do recovery during shutdown")
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
>> ---
>>   fs/ocfs2/dlm/dlmcommon.h   |  1 +
>>   fs/ocfs2/dlm/dlmdomain.c   | 14 ++
>>   fs/ocfs2/dlm/dlmrecovery.c | 12 +---
>>   3 files changed, 24 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
>> index e9f3705..999ab7d 100644
>> --- a/fs/ocfs2/dlm/dlmcommon.h
>> +++ b/fs/ocfs2/dlm/dlmcommon.h
>> @@ -140,6 +140,7 @@ struct dlm_ctxt
>>  u8 node_num;
>>  u32 key;
>>  u8  joining_node;
>> +u8 migrate_done; /* set to 1 means node has migrated all lockres */
>>  wait_queue_head_t dlm_join_events;
>>  unsigned long live_nodes_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
>>  unsigned long domain_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
>> index e1fea14..98a8f56 100644
>> --- a/fs/ocfs2/dlm/dlmdomain.c
>> +++ b/fs/ocfs2/dlm/dlmdomain.c
>> @@ -461,6 +461,18 @@ static int dlm_migrate_all_locks(struct dlm_ctxt *dlm)
>>  cond_resched_lock(>spinlock);
>>  num += n;
>>  }
>> +
>> +if (!num) {
>> +if (dlm->reco.state & DLM_RECO_STATE_ACTIVE) {
>> +mlog(0, "%s: perhaps there are more lock resources need 
>> to "
>> +"be migrated after dlm recovery\n", 
>> dlm->name);
> 
> If dlm is mark with DLM_RECO_STATE_ACTIVE, then a lock resource must 
> already be marked with DLM_LOCK_RES_RECOVERING which can't be migrated. 
> So code will goto redo_bucket in function dlm_migrate_all_locks.
> So I don't understand why a judgement is added here?
> 
> 
> 
because we have done migrating before recoverying. the judgement here
is to avoid the following potential recoverying.
>> +ret = -EAGAIN;
>> +} else {
>> +mlog(0, "%s: we won't d

Re: [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies

2017-10-18 Thread piaojun



On 2017/10/17 14:48, Changwei Ge wrote:
> When a node dies, other live nodes have to choose a new master
> for an existed lock resource mastered by the dead node.
> 
> As for ocfs2/dlm implementation, this is done by function -
> dlm_move_lockres_to_recovery_list which marks those lock rsources
> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
> DLM changes lock resource's master later.
> 
> So without invoking dlm_move_lockres_to_recovery_list, no master will
> be choosed after dlm recovery accomplishment since no lock resource can
> be found through ::resource list.
> 
> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
> lock resources mastered a dead node, it will break up synchronization
> among nodes.
> 
> So invoke dlm_move_lockres_to_recovery_list again.
> 
> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
> lockres when recovery master goes down")'
> 
> Reported-by: Vitaly Mayatskih 
> Signed-off-by: Changwei Ge 
Reviewed-by: Jun Piao 

> ---
>   fs/ocfs2/dlm/dlmrecovery.c |1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 74407c6..ec8f758 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct 
> dlm_ctxt *dlm, u8 dead_node)
>   dlm_lockres_put(res);
>   continue;
>   }
> + dlm_move_lockres_to_recovery_list(dlm, res);
>   } else if (res->owner == dlm->node_num) {
>   dlm_free_dead_locks(dlm, res, dead_node);
>   __dlm_lockres_calc_usage(dlm, res);
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies

2017-10-18 Thread piaojun

Hi Changwei,

Could you share the method to reproduce the problem?

On 2017/10/17 14:48, Changwei Ge wrote:
> When a node dies, other live nodes have to choose a new master
> for an existed lock resource mastered by the dead node.
> 
> As for ocfs2/dlm implementation, this is done by function -
> dlm_move_lockres_to_recovery_list which marks those lock rsources
> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
> DLM changes lock resource's master later.
> 
> So without invoking dlm_move_lockres_to_recovery_list, no master will
> be choosed after dlm recovery accomplishment since no lock resource can
> be found through ::resource list.
> 
> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
> lock resources mastered a dead node, it will break up synchronization
> among nodes.
> 
> So invoke dlm_move_lockres_to_recovery_list again.
> 
> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
> lockres when recovery master goes down")'
> 
> Reported-by: Vitaly Mayatskih 
> Signed-off-by: Changwei Ge 
> ---
>   fs/ocfs2/dlm/dlmrecovery.c |1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 74407c6..ec8f758 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct 
> dlm_ctxt *dlm, u8 dead_node)
>   dlm_lockres_put(res);
>   continue;
>   }
> + dlm_move_lockres_to_recovery_list(dlm, res);
>   } else if (res->owner == dlm->node_num) {
>   dlm_free_dead_locks(dlm, res, dead_node);
>   __dlm_lockres_calc_usage(dlm, res);
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 2/2] ocfs2: cleanup unused func declaration and assignment

2017-10-13 Thread piaojun

Signed-off-by: Jun Piao 
---
 fs/ocfs2/alloc.c | 2 --
 fs/ocfs2/cluster/heartbeat.h | 2 --
 2 files changed, 4 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index a177eae..31a416d 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -3585,8 +3585,6 @@ static int ocfs2_merge_rec_left(struct ocfs2_path 
*right_path,
 * The easy case - we can just plop the record right in.
 */
*left_rec = *split_rec;
-
-   has_empty_extent = 0;
} else
le16_add_cpu(_rec->e_leaf_clusters, split_clusters);

diff --git a/fs/ocfs2/cluster/heartbeat.h b/fs/ocfs2/cluster/heartbeat.h
index 3ef5137..a9e67ef 100644
--- a/fs/ocfs2/cluster/heartbeat.h
+++ b/fs/ocfs2/cluster/heartbeat.h
@@ -79,10 +79,8 @@ void o2hb_fill_node_map(unsigned long *map,
unsigned bytes);
 void o2hb_exit(void);
 int o2hb_init(void);
-int o2hb_check_node_heartbeating(u8 node_num);
 int o2hb_check_node_heartbeating_no_sem(u8 node_num);
 int o2hb_check_node_heartbeating_from_callback(u8 node_num);
-int o2hb_check_local_node_heartbeating(void);
 void o2hb_stop_all_regions(void);
 int o2hb_get_all_regions(char *region_uuids, u8 numregions);
 int o2hb_global_heartbeat_active(void);
-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/2] ocfs2: no need flush workqueue before destroying it

2017-10-13 Thread piaojun

destroy_workqueue() will do flushing work for us.

Signed-off-by: Jun Piao 
---
 fs/ocfs2/dlm/dlmdomain.c | 1 -
 fs/ocfs2/dlmfs/dlmfs.c   | 1 -
 fs/ocfs2/super.c | 4 +---
 3 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index a2b19fb..e1fea14 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -394,7 +394,6 @@ int dlm_domain_fully_joined(struct dlm_ctxt *dlm)
 static void dlm_destroy_dlm_worker(struct dlm_ctxt *dlm)
 {
if (dlm->dlm_worker) {
-   flush_workqueue(dlm->dlm_worker);
destroy_workqueue(dlm->dlm_worker);
dlm->dlm_worker = NULL;
}
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 9ab9e18..edce7b5 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -670,7 +670,6 @@ static void __exit exit_dlmfs_fs(void)
 {
unregister_filesystem(_fs_type);

-   flush_workqueue(user_dlm_worker);
destroy_workqueue(user_dlm_worker);

/*
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 8073349..040bbb6 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -2521,10 +2521,8 @@ static void ocfs2_delete_osb(struct ocfs2_super *osb)
/* This function assumes that the caller has the main osb resource */

/* ocfs2_initializer_super have already created this workqueue */
-   if (osb->ocfs2_wq) {
-   flush_workqueue(osb->ocfs2_wq);
+   if (osb->ocfs2_wq)
destroy_workqueue(osb->ocfs2_wq);
-   }

ocfs2_free_slot_info(osb);

-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2: no need flush workqueue before destroying it

2017-10-12 Thread piaojun

1. delete redundant flush_workqueue();
2. delete some unused func declaration and assignment.

Signed-off-by: Jun Piao 
---
 fs/ocfs2/alloc.c | 2 --
 fs/ocfs2/cluster/heartbeat.h | 2 --
 fs/ocfs2/dlm/dlmdomain.c | 1 -
 fs/ocfs2/dlmfs/dlmfs.c   | 1 -
 fs/ocfs2/super.c | 4 +---
 5 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index a177eae..31a416d 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -3585,8 +3585,6 @@ static int ocfs2_merge_rec_left(struct ocfs2_path 
*right_path,
 * The easy case - we can just plop the record right in.
 */
*left_rec = *split_rec;
-
-   has_empty_extent = 0;
} else
le16_add_cpu(_rec->e_leaf_clusters, split_clusters);

diff --git a/fs/ocfs2/cluster/heartbeat.h b/fs/ocfs2/cluster/heartbeat.h
index 3ef5137..a9e67ef 100644
--- a/fs/ocfs2/cluster/heartbeat.h
+++ b/fs/ocfs2/cluster/heartbeat.h
@@ -79,10 +79,8 @@ void o2hb_fill_node_map(unsigned long *map,
unsigned bytes);
 void o2hb_exit(void);
 int o2hb_init(void);
-int o2hb_check_node_heartbeating(u8 node_num);
 int o2hb_check_node_heartbeating_no_sem(u8 node_num);
 int o2hb_check_node_heartbeating_from_callback(u8 node_num);
-int o2hb_check_local_node_heartbeating(void);
 void o2hb_stop_all_regions(void);
 int o2hb_get_all_regions(char *region_uuids, u8 numregions);
 int o2hb_global_heartbeat_active(void);
diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index a2b19fb..e1fea14 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -394,7 +394,6 @@ int dlm_domain_fully_joined(struct dlm_ctxt *dlm)
 static void dlm_destroy_dlm_worker(struct dlm_ctxt *dlm)
 {
if (dlm->dlm_worker) {
-   flush_workqueue(dlm->dlm_worker);
destroy_workqueue(dlm->dlm_worker);
dlm->dlm_worker = NULL;
}
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 9ab9e18..edce7b5 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -670,7 +670,6 @@ static void __exit exit_dlmfs_fs(void)
 {
unregister_filesystem(_fs_type);

-   flush_workqueue(user_dlm_worker);
destroy_workqueue(user_dlm_worker);

/*
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 8073349..040bbb6 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -2521,10 +2521,8 @@ static void ocfs2_delete_osb(struct ocfs2_super *osb)
/* This function assumes that the caller has the main osb resource */

/* ocfs2_initializer_super have already created this workqueue */
-   if (osb->ocfs2_wq) {
-   flush_workqueue(osb->ocfs2_wq);
+   if (osb->ocfs2_wq)
destroy_workqueue(osb->ocfs2_wq);
-   }

ocfs2_free_slot_info(osb);

-- 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [patch 1/2] ocfs2/dlm: protect 'tracking_list' by 'track_lock'

2017-09-26 Thread piaojun



On 2017/9/27 9:32, Joseph Qi wrote:
> 
> 
> On 17/9/26 08:39, piaojun wrote:
>>
>>
>> On 2017/9/25 18:35, Joseph Qi wrote:
>>>
>>>
>>> On 17/9/23 11:39, piaojun wrote:
>>>> 'dlm->tracking_list' need to be protected by 'dlm->track_lock'.
>>>>
>>>> Signed-off-by: Jun Piao <piao...@huawei.com>
>>>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>>>> ---
>>>>  fs/ocfs2/dlm/dlmdomain.c | 7 ++-
>>>>  fs/ocfs2/dlm/dlmmaster.c | 4 ++--
>>>>  2 files changed, 8 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
>>>> index a2b19fb..b118525 100644
>>>> --- a/fs/ocfs2/dlm/dlmdomain.c
>>>> +++ b/fs/ocfs2/dlm/dlmdomain.c
>>>> @@ -726,12 +726,17 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm)
>>>>}
>>>>
>>>>/* This list should be empty. If not, print remaining lockres */
>>>> +  spin_lock(>track_lock);
>>>>if (!list_empty(>tracking_list)) {
>>>>mlog(ML_ERROR, "Following lockres' are still on the "
>>>> "tracking list:\n");
>>>> -  list_for_each_entry(res, >tracking_list, tracking)
>>>> +  list_for_each_entry(res, >tracking_list, tracking) 
>>>> {
>>>> +  spin_unlock(>track_lock);
>>>
>>> Um... If we unlock here, the iterator still has chance to be corrupted.
>>>
>>> Thanks,
>>> Joseph
>>>
>>
>> we don't need care much about the corrupted 'tracking_list' because we
>> have already picked up 'res' from 'tracking_list'. then we will get
>> 'track_lock' again to prevent 'tracking_list' from being corrupted. but
>> I'd better make sure that 'res' is not NULL before printing, just like:
>>
>> list_for_each_entry(res, >tracking_list, tracking) {
>>  spin_unlock(>track_lock);
>>  if (res)
>>  dlm_print_one_lock_resource(res);
>>  spin_lock(>track_lock);
>> }
>>
>> Thanks
>> Jun
> 
> IIUC, your intent to add track lock here is to protect tracking list
> when iterate the loop, right? I am saying that if unlock track lock
> here, the loop is still unsafe.
> Checking res here is meaningless. Maybe list_for_each_entry_safe
> could work here.
> BTW, how this race case happens? The above code is during umount,
> what is the other flow?
> 
> Thanks,
> Joseph
> .
> 

I have not caught the race case yet, and the code rarely enter this
branch, because 'tracking list' is always empty here. the key problem
is we try to protect 'tracking list' under 'res->spinlock' and
'dlm->track_lock', but we could not get 'res->spinlock' before
iterating 'tracking list'. I have to figured it out further.

thanks
Jun

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [patch 1/2] ocfs2/dlm: protect 'tracking_list' by 'track_lock'

2017-09-25 Thread piaojun



On 2017/9/25 18:35, Joseph Qi wrote:
> 
> 
> On 17/9/23 11:39, piaojun wrote:
>> 'dlm->tracking_list' need to be protected by 'dlm->track_lock'.
>>
>> Signed-off-by: Jun Piao <piao...@huawei.com>
>> Reviewed-by: Alex Chen <alex.c...@huawei.com>
>> ---
>>  fs/ocfs2/dlm/dlmdomain.c | 7 ++-
>>  fs/ocfs2/dlm/dlmmaster.c | 4 ++--
>>  2 files changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
>> index a2b19fb..b118525 100644
>> --- a/fs/ocfs2/dlm/dlmdomain.c
>> +++ b/fs/ocfs2/dlm/dlmdomain.c
>> @@ -726,12 +726,17 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm)
>>  }
>>
>>  /* This list should be empty. If not, print remaining lockres */
>> +spin_lock(>track_lock);
>>  if (!list_empty(>tracking_list)) {
>>  mlog(ML_ERROR, "Following lockres' are still on the "
>>   "tracking list:\n");
>> -list_for_each_entry(res, >tracking_list, tracking)
>> +list_for_each_entry(res, >tracking_list, tracking) 
>> {
>> +spin_unlock(>track_lock);
> 
> Um... If we unlock here, the iterator still has chance to be corrupted.
> 
> Thanks,
> Joseph
> 

we don't need care much about the corrupted 'tracking_list' because we
have already picked up 'res' from 'tracking_list'. then we will get
'track_lock' again to prevent 'tracking_list' from being corrupted. but
I'd better make sure that 'res' is not NULL before printing, just like:

list_for_each_entry(res, >tracking_list, tracking) {
spin_unlock(>track_lock);
if (res)
dlm_print_one_lock_resource(res);
spin_lock(>track_lock);
}

Thanks
Jun

>>  dlm_print_one_lock_resource(res);
>> +spin_lock(>track_lock);
>> +}
>>  }
>> +spin_unlock(>track_lock);
>>
>>  dlm_mark_domain_leaving(dlm);
>>  dlm_leave_domain(dlm);
>> diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
>> index 3e04279..44e7d18 100644
>> --- a/fs/ocfs2/dlm/dlmmaster.c
>> +++ b/fs/ocfs2/dlm/dlmmaster.c
>> @@ -589,9 +589,9 @@ static void dlm_init_lockres(struct dlm_ctxt *dlm,
>>
>>  res->last_used = 0;
>>
>> -spin_lock(>spinlock);
>> +spin_lock(>track_lock);
>>  list_add_tail(>tracking, >tracking_list);
>> -spin_unlock(>spinlock);
>> +spin_unlock(>track_lock);
>>
>>  memset(res->lvb, 0, DLM_LVB_LEN);
>>  memset(res->refmap, 0, sizeof(res->refmap));
>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

1 2 >

1 - 100 of 125 matches

Mail list logo