Re: [Gluster-users] question about info and info.tmp

2016-11-24 Thread Atin Mukherjee
On Fri, Nov 25, 2016 at 1:14 PM, songxin  wrote:

> Hi Atin,
> It seems that this workaround should be done by manual.
> Is that right?
> And even the files in bricks/* may be empty too.
>

Yes, that's right


>
> Do you have a workaround, which is implemented in glusterfs code?
>

Workaround is by nature manual and anything to be done through code should
be considered as fix not work around :)


>
> Thanks,
> Xin
>
>
>
>
>
> 在 2016-11-25 15:36:29,"Atin Mukherjee"  写道:
>
>
>
> On Fri, Nov 25, 2016 at 12:06 PM, songxin  wrote:
>
>> Hi Atin,
>> Do you mean that you have the workaround applicable now?
>> Or it will take time to design the workaround?
>>
>> If you have workaround now, could you share it to me ?
>>
>
> If you end up in having a 0 byte info file you'd need to copy the same
> info file from other node and put it there and restart glusterd.
>
>
>>
>> Thanks,
>> Xin,
>>
>>
>>
>>
>>
>> 在 2016-11-24 19:12:07,"Atin Mukherjee"  写道:
>>
>> Xin - I appreciate your patience. I'd need some more time to pick this
>> item up from my backlog. I believe we have a workaround applicable here too.
>>
>> On Thu, 24 Nov 2016 at 14:24, songxin  wrote:
>>
>>>
>>>
>>>
>>> Hi Atin,
>>> Actually, the glusterfs is used in my project.
>>> And our test team find this issue.
>>> So I want to make sure that whether you plan to fix it.
>>> if you have plan I will wait you because your method shoud be better
>>> than mine.
>>>
>>> Thanks,
>>> Xin
>>>
>>>
>>> 在 2016-11-21 10:00:36,"Atin Mukherjee"  写道:
>>>
>>> Hi Xin,
>>>
>>> I've not got a chance to look into it yet. delete stale volume function
>>> is in place to take care of wiping off volume configuration data which has
>>> been deleted from the cluster. However we need to revisit this code to see
>>> if this function is anymore needed given we recently added a validation to
>>> fail delete request if one of the glusterd is down. I'll get back to you on
>>> this.
>>>
>>> On Mon, 21 Nov 2016 at 07:24, songxin  wrote:
>>>
>>> Hi Atin,
>>> Thank you for your support.
>>>
>>> And any conclusions about this issue?
>>>
>>> Thanks,
>>> Xin
>>>
>>>
>>>
>>>
>>>
>>> 在 2016-11-16 20:59:05,"Atin Mukherjee"  写道:
>>>
>>>
>>>
>>> On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:
>>>
>>> ok, thank you.
>>>
>>>
>>>
>>>
>>> 在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:
>>>
>>>
>>>
>>> On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:
>>>
>>>
>>> Hi Atin,
>>>
>>> I think the root cause is in the function glusterd_import_friend_volume
>>> as below.
>>>
>>> int32_t
>>> glusterd_import_friend_volume (dict_t *peer_data, size_t count)
>>> {
>>> ...
>>> ret = glusterd_volinfo_find (new_volinfo->volname,
>>> _volinfo);
>>> if (0 == ret) {
>>> (void) gd_check_and_update_rebalance_info (old_volinfo,
>>>n
>>> ew_volinfo);
>>> (void) glusterd_delete_stale_volume (old_volinfo,
>>> new_volinfo);
>>> }
>>> ...
>>> ret = glusterd_store_volinfo (new_volinfo,
>>> GLUSTERD_VOLINFO_VER_AC_NONE);
>>> if (ret) {
>>> gf_msg (this->name, GF_LOG_ERROR, 0,
>>> GD_MSG_VOLINFO_STORE_FAIL, "Failed to store "
>>> "volinfo for volume %s", new_volinfo->volname);
>>> goto out;
>>> }
>>> ...
>>> }
>>>
>>> glusterd_delete_stale_volume will remove the info and bricks/* and the
>>> glusterd_store_volinfo will create the new one.
>>> But if glusterd is killed before rename the info will is empty.
>>>
>>> And glusterd will start failed because the infois empty in the next time
>>> you start the glusterd.
>>>
>>> Any idea, Atin?
>>>
>>>
>>> Give me some time, will check it out, but reading at this analysis looks
>>> very well possible if a volume is changed when the glusterd was done on
>>> node a and when the same comes up during peer handshake we update the
>>> volinfo and during that time glusterd goes down once again. I'll confirm it
>>> by tomorrow.
>>>
>>>
>>> I checked the code and it does look like you have got the right RCA for
>>> the issue which you simulated through those two scripts. However this can
>>> happen even when you try to create a fresh volume and while glusterd tries
>>> to write the content into the store and goes down before renaming the
>>> info.tmp file you get into the same situation.
>>>
>>> I'd really need to think through if this can be fixed. Suggestions are
>>> always appreciated.
>>>
>>>
>>>
>>>
>>> BTW, excellent work Xin!
>>>
>>>
>>> Thanks,
>>> Xin
>>>
>>>
>>> 在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:
>>>
>>>
>>>
>>> On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:
>>>
>>> Hi Atin,
>>> I have some 

Re: [Gluster-users] question about info and info.tmp

2016-11-24 Thread songxin
Hi Atin,
It seems that this workaround should be done by manual.
Is that right?
And even the files in bricks/* may be empty too.


Do you have a workaround, which is implemented in glusterfs code?


Thanks,
Xin






在 2016-11-25 15:36:29,"Atin Mukherjee"  写道:





On Fri, Nov 25, 2016 at 12:06 PM, songxin  wrote:

Hi Atin,
Do you mean that you have the workaround applicable now?
Or it will take time to design the workaround?


If you have workaround now, could you share it to me ?


If you end up in having a 0 byte info file you'd need to copy the same info 
file from other node and put it there and restart glusterd.
 



Thanks,
Xin,






在 2016-11-24 19:12:07,"Atin Mukherjee"  写道:

Xin - I appreciate your patience. I'd need some more time to pick this item up 
from my backlog. I believe we have a workaround applicable here too.


On Thu, 24 Nov 2016 at 14:24, songxin  wrote:





Hi Atin,
Actually, the glusterfs is used in my project.
And our test team find this issue.
So I want to make sure that whether you plan to fix it.
if you have plan I will wait you because your method shoud be better than mine.


Thanks,
Xin



在 2016-11-21 10:00:36,"Atin Mukherjee"  写道:

Hi Xin,

I've not got a chance to look into it yet. delete stale volume function is in 
place to take care of wiping off volume configuration data which has been 
deleted from the cluster. However we need to revisit this code to see if this 
function is anymore needed given we recently added a validation to fail delete 
request if one of the glusterd is down. I'll get back to you on this.


On Mon, 21 Nov 2016 at 07:24, songxin  wrote:

Hi Atin,
Thank you for your support.


And any conclusions about this issue?


Thanks,
Xin






在 2016-11-16 20:59:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:

ok, thank you.





在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:



Hi Atin,


I think the root cause is in the function glusterd_import_friend_volume as 
below. 

int32_t 
glusterd_import_friend_volume (dict_t *peer_data, size_t count) 
{ 
... 
ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo); 
if (0 == ret) { 
(void) gd_check_and_update_rebalance_info (old_volinfo, 
   new_volinfo); 
(void) glusterd_delete_stale_volume (old_volinfo, new_volinfo); 
} 
... 
ret = glusterd_store_volinfo (new_volinfo, 
GLUSTERD_VOLINFO_VER_AC_NONE); 
if (ret) { 
gf_msg (this->name, GF_LOG_ERROR, 0, 
GD_MSG_VOLINFO_STORE_FAIL, "Failed to store " 
"volinfo for volume %s", new_volinfo->volname); 
goto out; 
} 
... 
} 

glusterd_delete_stale_volume will remove the info and bricks/* and the 
glusterd_store_volinfo will create the new one. 
But if glusterd is killed before rename the info will is empty. 


And glusterd will start failed because the infois empty in the next time you 
start the glusterd.


Any idea, Atin?


Give me some time, will check it out, but reading at this analysis looks very 
well possible if a volume is changed when the glusterd was done on node a and 
when the same comes up during peer handshake we update the volinfo and during 
that time glusterd goes down once again. I'll confirm it by tomorrow.



I checked the code and it does look like you have got the right RCA for the 
issue which you simulated through those two scripts. However this can happen 
even when you try to create a fresh volume and while glusterd tries to write 
the content into the store and goes down before renaming the info.tmp file you 
get into the same situation.


I'd really need to think through if this can be fixed. Suggestions are always 
appreciated.

 



BTW, excellent work Xin!




Thanks,
Xin



在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


I really appreciate your help in trying to nail down this issue. While I am at 
your email and going through the code to figure out the possible cause for it, 
unfortunately I don't see any script in the attachment of the bug.  Could you 
please cross check?
 



After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo 

Re: [Gluster-users] question about info and info.tmp

2016-11-24 Thread Atin Mukherjee
On Fri, Nov 25, 2016 at 12:06 PM, songxin  wrote:

> Hi Atin,
> Do you mean that you have the workaround applicable now?
> Or it will take time to design the workaround?
>
> If you have workaround now, could you share it to me ?
>

If you end up in having a 0 byte info file you'd need to copy the same info
file from other node and put it there and restart glusterd.


>
> Thanks,
> Xin,
>
>
>
>
>
> 在 2016-11-24 19:12:07,"Atin Mukherjee"  写道:
>
> Xin - I appreciate your patience. I'd need some more time to pick this
> item up from my backlog. I believe we have a workaround applicable here too.
>
> On Thu, 24 Nov 2016 at 14:24, songxin  wrote:
>
>>
>>
>>
>> Hi Atin,
>> Actually, the glusterfs is used in my project.
>> And our test team find this issue.
>> So I want to make sure that whether you plan to fix it.
>> if you have plan I will wait you because your method shoud be better than
>> mine.
>>
>> Thanks,
>> Xin
>>
>>
>> 在 2016-11-21 10:00:36,"Atin Mukherjee"  写道:
>>
>> Hi Xin,
>>
>> I've not got a chance to look into it yet. delete stale volume function
>> is in place to take care of wiping off volume configuration data which has
>> been deleted from the cluster. However we need to revisit this code to see
>> if this function is anymore needed given we recently added a validation to
>> fail delete request if one of the glusterd is down. I'll get back to you on
>> this.
>>
>> On Mon, 21 Nov 2016 at 07:24, songxin  wrote:
>>
>> Hi Atin,
>> Thank you for your support.
>>
>> And any conclusions about this issue?
>>
>> Thanks,
>> Xin
>>
>>
>>
>>
>>
>> 在 2016-11-16 20:59:05,"Atin Mukherjee"  写道:
>>
>>
>>
>> On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:
>>
>> ok, thank you.
>>
>>
>>
>>
>> 在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:
>>
>>
>>
>> On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:
>>
>>
>> Hi Atin,
>>
>> I think the root cause is in the function glusterd_import_friend_volume
>> as below.
>>
>> int32_t
>> glusterd_import_friend_volume (dict_t *peer_data, size_t count)
>> {
>> ...
>> ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo);
>> if (0 == ret) {
>> (void) gd_check_and_update_rebalance_info (old_volinfo,
>>new_volinfo);
>> (void) glusterd_delete_stale_volume (old_volinfo,
>> new_volinfo);
>> }
>> ...
>> ret = glusterd_store_volinfo (new_volinfo,
>> GLUSTERD_VOLINFO_VER_AC_NONE);
>> if (ret) {
>> gf_msg (this->name, GF_LOG_ERROR, 0,
>> GD_MSG_VOLINFO_STORE_FAIL, "Failed to store "
>> "volinfo for volume %s", new_volinfo->volname);
>> goto out;
>> }
>> ...
>> }
>>
>> glusterd_delete_stale_volume will remove the info and bricks/* and the
>> glusterd_store_volinfo will create the new one.
>> But if glusterd is killed before rename the info will is empty.
>>
>> And glusterd will start failed because the infois empty in the next time
>> you start the glusterd.
>>
>> Any idea, Atin?
>>
>>
>> Give me some time, will check it out, but reading at this analysis looks
>> very well possible if a volume is changed when the glusterd was done on
>> node a and when the same comes up during peer handshake we update the
>> volinfo and during that time glusterd goes down once again. I'll confirm it
>> by tomorrow.
>>
>>
>> I checked the code and it does look like you have got the right RCA for
>> the issue which you simulated through those two scripts. However this can
>> happen even when you try to create a fresh volume and while glusterd tries
>> to write the content into the store and goes down before renaming the
>> info.tmp file you get into the same situation.
>>
>> I'd really need to think through if this can be fixed. Suggestions are
>> always appreciated.
>>
>>
>>
>>
>> BTW, excellent work Xin!
>>
>>
>> Thanks,
>> Xin
>>
>>
>> 在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:
>>
>>
>>
>> On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:
>>
>> Hi Atin,
>> I have some clues about this issue.
>> I could reproduce this issue use the scrip that mentioned in
>> https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .
>>
>>
>> I really appreciate your help in trying to nail down this issue. While I
>> am at your email and going through the code to figure out the possible
>> cause for it, unfortunately I don't see any script in the attachment of the
>> bug.  Could you please cross check?
>>
>>
>>
>> After I added some debug print,which like below, in glusterd-store.c and
>> I found that the /var/lib/glusterd/vols/xxx/info and
>> /var/lib/glusterd/vols/xxx/bricks/* are removed.
>> But other files in /var/lib/glusterd/vols/xxx/ will not be remove.
>>
>> 

Re: [Gluster-users] question about info and info.tmp

2016-11-24 Thread songxin
Hi Atin,
Do you mean that you have the workaround applicable now?
Or it will take time to design the workaround?


If you have workaround now, could you share it to me ?


Thanks,
Xin,






在 2016-11-24 19:12:07,"Atin Mukherjee"  写道:

Xin - I appreciate your patience. I'd need some more time to pick this item up 
from my backlog. I believe we have a workaround applicable here too.


On Thu, 24 Nov 2016 at 14:24, songxin  wrote:





Hi Atin,
Actually, the glusterfs is used in my project.
And our test team find this issue.
So I want to make sure that whether you plan to fix it.
if you have plan I will wait you because your method shoud be better than mine.


Thanks,
Xin



在 2016-11-21 10:00:36,"Atin Mukherjee"  写道:

Hi Xin,

I've not got a chance to look into it yet. delete stale volume function is in 
place to take care of wiping off volume configuration data which has been 
deleted from the cluster. However we need to revisit this code to see if this 
function is anymore needed given we recently added a validation to fail delete 
request if one of the glusterd is down. I'll get back to you on this.


On Mon, 21 Nov 2016 at 07:24, songxin  wrote:

Hi Atin,
Thank you for your support.


And any conclusions about this issue?


Thanks,
Xin






在 2016-11-16 20:59:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:

ok, thank you.





在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:



Hi Atin,


I think the root cause is in the function glusterd_import_friend_volume as 
below. 

int32_t 
glusterd_import_friend_volume (dict_t *peer_data, size_t count) 
{ 
... 
ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo); 
if (0 == ret) { 
(void) gd_check_and_update_rebalance_info (old_volinfo, 
   new_volinfo); 
(void) glusterd_delete_stale_volume (old_volinfo, new_volinfo); 
} 
... 
ret = glusterd_store_volinfo (new_volinfo, 
GLUSTERD_VOLINFO_VER_AC_NONE); 
if (ret) { 
gf_msg (this->name, GF_LOG_ERROR, 0, 
GD_MSG_VOLINFO_STORE_FAIL, "Failed to store " 
"volinfo for volume %s", new_volinfo->volname); 
goto out; 
} 
... 
} 

glusterd_delete_stale_volume will remove the info and bricks/* and the 
glusterd_store_volinfo will create the new one. 
But if glusterd is killed before rename the info will is empty. 


And glusterd will start failed because the infois empty in the next time you 
start the glusterd.


Any idea, Atin?


Give me some time, will check it out, but reading at this analysis looks very 
well possible if a volume is changed when the glusterd was done on node a and 
when the same comes up during peer handshake we update the volinfo and during 
that time glusterd goes down once again. I'll confirm it by tomorrow.



I checked the code and it does look like you have got the right RCA for the 
issue which you simulated through those two scripts. However this can happen 
even when you try to create a fresh volume and while glusterd tries to write 
the content into the store and goes down before renaming the info.tmp file you 
get into the same situation.


I'd really need to think through if this can be fixed. Suggestions are always 
appreciated.

 



BTW, excellent work Xin!




Thanks,
Xin



在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


I really appreciate your help in trying to nail down this issue. While I am at 
your email and going through the code to figure out the possible cause for it, 
unfortunately I don't see any script in the attachment of the bug.  Could you 
please cross check?
 



After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo (glusterd_volinfo_t *volinfo, glusterd_volinfo_ver_ac_t 
ac)
{
int32_t ret = -1;


GF_ASSERT (volinfo)


ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not exit(%d)", 
errno);
}
else
{
ret = stat("/var/lib/glusterd/vols/gv0/info", );
if(ret < 0)
{
gf_msg (THIS->name, 

Re: [Gluster-users] question about info and info.tmp

2016-11-24 Thread Atin Mukherjee
Xin - I appreciate your patience. I'd need some more time to pick this item
up from my backlog. I believe we have a workaround applicable here too.

On Thu, 24 Nov 2016 at 14:24, songxin  wrote:

>
>
>
> Hi Atin,
> Actually, the glusterfs is used in my project.
> And our test team find this issue.
> So I want to make sure that whether you plan to fix it.
> if you have plan I will wait you because your method shoud be better than
> mine.
>
> Thanks,
> Xin
>
>
> 在 2016-11-21 10:00:36,"Atin Mukherjee"  写道:
>
> Hi Xin,
>
> I've not got a chance to look into it yet. delete stale volume function is
> in place to take care of wiping off volume configuration data which has
> been deleted from the cluster. However we need to revisit this code to see
> if this function is anymore needed given we recently added a validation to
> fail delete request if one of the glusterd is down. I'll get back to you on
> this.
>
> On Mon, 21 Nov 2016 at 07:24, songxin  wrote:
>
> Hi Atin,
> Thank you for your support.
>
> And any conclusions about this issue?
>
> Thanks,
> Xin
>
>
>
>
>
> 在 2016-11-16 20:59:05,"Atin Mukherjee"  写道:
>
>
>
> On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:
>
> ok, thank you.
>
>
>
>
> 在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:
>
>
>
> On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:
>
>
> Hi Atin,
>
> I think the root cause is in the function glusterd_import_friend_volume as
> below.
>
> int32_t
> glusterd_import_friend_volume (dict_t *peer_data, size_t count)
> {
> ...
> ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo);
> if (0 == ret) {
> (void) gd_check_and_update_rebalance_info (old_volinfo,
>new_volinfo);
> (void) glusterd_delete_stale_volume (old_volinfo,
> new_volinfo);
> }
> ...
> ret = glusterd_store_volinfo (new_volinfo,
> GLUSTERD_VOLINFO_VER_AC_NONE);
> if (ret) {
> gf_msg (this->name, GF_LOG_ERROR, 0,
> GD_MSG_VOLINFO_STORE_FAIL, "Failed to store "
> "volinfo for volume %s", new_volinfo->volname);
> goto out;
> }
> ...
> }
>
> glusterd_delete_stale_volume will remove the info and bricks/* and the
> glusterd_store_volinfo will create the new one.
> But if glusterd is killed before rename the info will is empty.
>
> And glusterd will start failed because the infois empty in the next time
> you start the glusterd.
>
> Any idea, Atin?
>
>
> Give me some time, will check it out, but reading at this analysis looks
> very well possible if a volume is changed when the glusterd was done on
> node a and when the same comes up during peer handshake we update the
> volinfo and during that time glusterd goes down once again. I'll confirm it
> by tomorrow.
>
>
> I checked the code and it does look like you have got the right RCA for
> the issue which you simulated through those two scripts. However this can
> happen even when you try to create a fresh volume and while glusterd tries
> to write the content into the store and goes down before renaming the
> info.tmp file you get into the same situation.
>
> I'd really need to think through if this can be fixed. Suggestions are
> always appreciated.
>
>
>
>
> BTW, excellent work Xin!
>
>
> Thanks,
> Xin
>
>
> 在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:
>
>
>
> On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:
>
> Hi Atin,
> I have some clues about this issue.
> I could reproduce this issue use the scrip that mentioned in
> https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .
>
>
> I really appreciate your help in trying to nail down this issue. While I
> am at your email and going through the code to figure out the possible
> cause for it, unfortunately I don't see any script in the attachment of the
> bug.  Could you please cross check?
>
>
>
> After I added some debug print,which like below, in glusterd-store.c and I
> found that the /var/lib/glusterd/vols/xxx/info and 
> /var/lib/glusterd/vols/xxx/bricks/*
> are removed.
> But other files in /var/lib/glusterd/vols/xxx/ will not be remove.
>
> int32_t
> glusterd_store_volinfo (glusterd_volinfo_t *volinfo,
> glusterd_volinfo_ver_ac_t ac)
> {
> int32_t ret = -1;
>
> GF_ASSERT (volinfo)
>
> ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
> if(ret < 0)
> {
> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not
> exit(%d)", errno);
> }
> else
> {
> ret = stat("/var/lib/glusterd/vols/gv0/info", );
> if(ret < 0)
> {
> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info
> error");
> }
>

Re: [Gluster-users] question about info and info.tmp

2016-11-24 Thread songxin




Hi Atin,
Actually, the glusterfs is used in my project.
And our test team find this issue.
So I want to make sure that whether you plan to fix it.
if you have plan I will wait you because your method shoud be better than mine.


Thanks,
Xin



在 2016-11-21 10:00:36,"Atin Mukherjee"  写道:

Hi Xin,

I've not got a chance to look into it yet. delete stale volume function is in 
place to take care of wiping off volume configuration data which has been 
deleted from the cluster. However we need to revisit this code to see if this 
function is anymore needed given we recently added a validation to fail delete 
request if one of the glusterd is down. I'll get back to you on this.


On Mon, 21 Nov 2016 at 07:24, songxin  wrote:

Hi Atin,
Thank you for your support.


And any conclusions about this issue?


Thanks,
Xin






在 2016-11-16 20:59:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:

ok, thank you.





在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:



Hi Atin,


I think the root cause is in the function glusterd_import_friend_volume as 
below. 

int32_t 
glusterd_import_friend_volume (dict_t *peer_data, size_t count) 
{ 
... 
ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo); 
if (0 == ret) { 
(void) gd_check_and_update_rebalance_info (old_volinfo, 
   new_volinfo); 
(void) glusterd_delete_stale_volume (old_volinfo, new_volinfo); 
} 
... 
ret = glusterd_store_volinfo (new_volinfo, 
GLUSTERD_VOLINFO_VER_AC_NONE); 
if (ret) { 
gf_msg (this->name, GF_LOG_ERROR, 0, 
GD_MSG_VOLINFO_STORE_FAIL, "Failed to store " 
"volinfo for volume %s", new_volinfo->volname); 
goto out; 
} 
... 
} 

glusterd_delete_stale_volume will remove the info and bricks/* and the 
glusterd_store_volinfo will create the new one. 
But if glusterd is killed before rename the info will is empty. 


And glusterd will start failed because the infois empty in the next time you 
start the glusterd.


Any idea, Atin?


Give me some time, will check it out, but reading at this analysis looks very 
well possible if a volume is changed when the glusterd was done on node a and 
when the same comes up during peer handshake we update the volinfo and during 
that time glusterd goes down once again. I'll confirm it by tomorrow.



I checked the code and it does look like you have got the right RCA for the 
issue which you simulated through those two scripts. However this can happen 
even when you try to create a fresh volume and while glusterd tries to write 
the content into the store and goes down before renaming the info.tmp file you 
get into the same situation.


I'd really need to think through if this can be fixed. Suggestions are always 
appreciated.

 



BTW, excellent work Xin!




Thanks,
Xin



在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


I really appreciate your help in trying to nail down this issue. While I am at 
your email and going through the code to figure out the possible cause for it, 
unfortunately I don't see any script in the attachment of the bug.  Could you 
please cross check?
 



After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo (glusterd_volinfo_t *volinfo, glusterd_volinfo_ver_ac_t 
ac)
{
int32_t ret = -1;


GF_ASSERT (volinfo)


ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not exit(%d)", 
errno);
}
else
{
ret = stat("/var/lib/glusterd/vols/gv0/info", );
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info 
error");
}
else
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size is 
%lu, inode num is %lu", buf.st_size, buf.st_ino);
}
}


glusterd_perform_volinfo_version_action (volinfo, ac);
ret = glusterd_store_create_volume_dir (volinfo);
if (ret)
goto out;


...
}


So it is easy to understand why  the info or 

Re: [Gluster-users] question about info and info.tmp

2016-11-20 Thread songxin
Hi Atin,
Ok.Thank you for your reply.


Thanks,
Xin






在 2016-11-21 10:00:36,"Atin Mukherjee"  写道:

Hi Xin,

I've not got a chance to look into it yet. delete stale volume function is in 
place to take care of wiping off volume configuration data which has been 
deleted from the cluster. However we need to revisit this code to see if this 
function is anymore needed given we recently added a validation to fail delete 
request if one of the glusterd is down. I'll get back to you on this.


On Mon, 21 Nov 2016 at 07:24, songxin  wrote:

Hi Atin,
Thank you for your support.


And any conclusions about this issue?


Thanks,
Xin






在 2016-11-16 20:59:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:

ok, thank you.





在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:



Hi Atin,


I think the root cause is in the function glusterd_import_friend_volume as 
below. 

int32_t 
glusterd_import_friend_volume (dict_t *peer_data, size_t count) 
{ 
... 
ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo); 
if (0 == ret) { 
(void) gd_check_and_update_rebalance_info (old_volinfo, 
   new_volinfo); 
(void) glusterd_delete_stale_volume (old_volinfo, new_volinfo); 
} 
... 
ret = glusterd_store_volinfo (new_volinfo, 
GLUSTERD_VOLINFO_VER_AC_NONE); 
if (ret) { 
gf_msg (this->name, GF_LOG_ERROR, 0, 
GD_MSG_VOLINFO_STORE_FAIL, "Failed to store " 
"volinfo for volume %s", new_volinfo->volname); 
goto out; 
} 
... 
} 

glusterd_delete_stale_volume will remove the info and bricks/* and the 
glusterd_store_volinfo will create the new one. 
But if glusterd is killed before rename the info will is empty. 


And glusterd will start failed because the infois empty in the next time you 
start the glusterd.


Any idea, Atin?


Give me some time, will check it out, but reading at this analysis looks very 
well possible if a volume is changed when the glusterd was done on node a and 
when the same comes up during peer handshake we update the volinfo and during 
that time glusterd goes down once again. I'll confirm it by tomorrow.



I checked the code and it does look like you have got the right RCA for the 
issue which you simulated through those two scripts. However this can happen 
even when you try to create a fresh volume and while glusterd tries to write 
the content into the store and goes down before renaming the info.tmp file you 
get into the same situation.


I'd really need to think through if this can be fixed. Suggestions are always 
appreciated.

 



BTW, excellent work Xin!




Thanks,
Xin



在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


I really appreciate your help in trying to nail down this issue. While I am at 
your email and going through the code to figure out the possible cause for it, 
unfortunately I don't see any script in the attachment of the bug.  Could you 
please cross check?
 



After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo (glusterd_volinfo_t *volinfo, glusterd_volinfo_ver_ac_t 
ac)
{
int32_t ret = -1;


GF_ASSERT (volinfo)


ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not exit(%d)", 
errno);
}
else
{
ret = stat("/var/lib/glusterd/vols/gv0/info", );
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info 
error");
}
else
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size is 
%lu, inode num is %lu", buf.st_size, buf.st_ino);
}
}


glusterd_perform_volinfo_version_action (volinfo, ac);
ret = glusterd_store_create_volume_dir (volinfo);
if (ret)
goto out;


...
}


So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-brick 
sometimes is empty.
It is becaue the info file is not exist, and it will be create by “fd = open 
(path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function 

Re: [Gluster-users] question about info and info.tmp

2016-11-20 Thread Atin Mukherjee
Hi Xin,

I've not got a chance to look into it yet. delete stale volume function is
in place to take care of wiping off volume configuration data which has
been deleted from the cluster. However we need to revisit this code to see
if this function is anymore needed given we recently added a validation to
fail delete request if one of the glusterd is down. I'll get back to you on
this.

On Mon, 21 Nov 2016 at 07:24, songxin  wrote:

> Hi Atin,
> Thank you for your support.
>
> And any conclusions about this issue?
>
> Thanks,
> Xin
>
>
>
>
>
> 在 2016-11-16 20:59:05,"Atin Mukherjee"  写道:
>
>
>
> On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:
>
> ok, thank you.
>
>
>
>
> 在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:
>
>
>
> On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:
>
>
> Hi Atin,
>
> I think the root cause is in the function glusterd_import_friend_volume as
> below.
>
> int32_t
> glusterd_import_friend_volume (dict_t *peer_data, size_t count)
> {
> ...
> ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo);
> if (0 == ret) {
> (void) gd_check_and_update_rebalance_info (old_volinfo,
>new_volinfo);
> (void) glusterd_delete_stale_volume (old_volinfo,
> new_volinfo);
> }
> ...
> ret = glusterd_store_volinfo (new_volinfo,
> GLUSTERD_VOLINFO_VER_AC_NONE);
> if (ret) {
> gf_msg (this->name, GF_LOG_ERROR, 0,
> GD_MSG_VOLINFO_STORE_FAIL, "Failed to store "
> "volinfo for volume %s", new_volinfo->volname);
> goto out;
> }
> ...
> }
>
> glusterd_delete_stale_volume will remove the info and bricks/* and the
> glusterd_store_volinfo will create the new one.
> But if glusterd is killed before rename the info will is empty.
>
> And glusterd will start failed because the infois empty in the next time
> you start the glusterd.
>
> Any idea, Atin?
>
>
> Give me some time, will check it out, but reading at this analysis looks
> very well possible if a volume is changed when the glusterd was done on
> node a and when the same comes up during peer handshake we update the
> volinfo and during that time glusterd goes down once again. I'll confirm it
> by tomorrow.
>
>
> I checked the code and it does look like you have got the right RCA for
> the issue which you simulated through those two scripts. However this can
> happen even when you try to create a fresh volume and while glusterd tries
> to write the content into the store and goes down before renaming the
> info.tmp file you get into the same situation.
>
> I'd really need to think through if this can be fixed. Suggestions are
> always appreciated.
>
>
>
>
> BTW, excellent work Xin!
>
>
> Thanks,
> Xin
>
>
> 在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:
>
>
>
> On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:
>
> Hi Atin,
> I have some clues about this issue.
> I could reproduce this issue use the scrip that mentioned in
> https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .
>
>
> I really appreciate your help in trying to nail down this issue. While I
> am at your email and going through the code to figure out the possible
> cause for it, unfortunately I don't see any script in the attachment of the
> bug.  Could you please cross check?
>
>
>
> After I added some debug print,which like below, in glusterd-store.c and I
> found that the /var/lib/glusterd/vols/xxx/info and 
> /var/lib/glusterd/vols/xxx/bricks/*
> are removed.
> But other files in /var/lib/glusterd/vols/xxx/ will not be remove.
>
> int32_t
> glusterd_store_volinfo (glusterd_volinfo_t *volinfo,
> glusterd_volinfo_ver_ac_t ac)
> {
> int32_t ret = -1;
>
> GF_ASSERT (volinfo)
>
> ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
> if(ret < 0)
> {
> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not
> exit(%d)", errno);
> }
> else
> {
> ret = stat("/var/lib/glusterd/vols/gv0/info", );
> if(ret < 0)
> {
> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info
> error");
> }
> else
> {
> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size
> is %lu, inode num is %lu", buf.st_size, buf.st_ino);
> }
> }
>
> glusterd_perform_volinfo_version_action (volinfo, ac);
> ret = glusterd_store_create_volume_dir (volinfo);
> if (ret)
> goto out;
>
> ...
> }
>
> So it is easy to understand why  the info or
> 10.32.1.144.-opt-lvmdir-c2-brick sometimes is empty.
> It is becaue the info file is not exist, and it will be create by “fd =
> open (path, O_RDWR 

Re: [Gluster-users] question about info and info.tmp

2016-11-20 Thread songxin
Hi Atin,
Thank you for your support.


And any conclusions about this issue?


Thanks,
Xin






在 2016-11-16 20:59:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:

ok, thank you.





在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:



Hi Atin,


I think the root cause is in the function glusterd_import_friend_volume as 
below. 

int32_t 
glusterd_import_friend_volume (dict_t *peer_data, size_t count) 
{ 
... 
ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo); 
if (0 == ret) { 
(void) gd_check_and_update_rebalance_info (old_volinfo, 
   new_volinfo); 
(void) glusterd_delete_stale_volume (old_volinfo, new_volinfo); 
} 
... 
ret = glusterd_store_volinfo (new_volinfo, 
GLUSTERD_VOLINFO_VER_AC_NONE); 
if (ret) { 
gf_msg (this->name, GF_LOG_ERROR, 0, 
GD_MSG_VOLINFO_STORE_FAIL, "Failed to store " 
"volinfo for volume %s", new_volinfo->volname); 
goto out; 
} 
... 
} 

glusterd_delete_stale_volume will remove the info and bricks/* and the 
glusterd_store_volinfo will create the new one. 
But if glusterd is killed before rename the info will is empty. 


And glusterd will start failed because the infois empty in the next time you 
start the glusterd.


Any idea, Atin?


Give me some time, will check it out, but reading at this analysis looks very 
well possible if a volume is changed when the glusterd was done on node a and 
when the same comes up during peer handshake we update the volinfo and during 
that time glusterd goes down once again. I'll confirm it by tomorrow.



I checked the code and it does look like you have got the right RCA for the 
issue which you simulated through those two scripts. However this can happen 
even when you try to create a fresh volume and while glusterd tries to write 
the content into the store and goes down before renaming the info.tmp file you 
get into the same situation.


I'd really need to think through if this can be fixed. Suggestions are always 
appreciated.

 



BTW, excellent work Xin!




Thanks,
Xin



在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


I really appreciate your help in trying to nail down this issue. While I am at 
your email and going through the code to figure out the possible cause for it, 
unfortunately I don't see any script in the attachment of the bug.  Could you 
please cross check?
 



After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo (glusterd_volinfo_t *volinfo, glusterd_volinfo_ver_ac_t 
ac)
{
int32_t ret = -1;


GF_ASSERT (volinfo)


ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not exit(%d)", 
errno);
}
else
{
ret = stat("/var/lib/glusterd/vols/gv0/info", );
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info 
error");
}
else
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size is 
%lu, inode num is %lu", buf.st_size, buf.st_ino);
}
}


glusterd_perform_volinfo_version_action (volinfo, ac);
ret = glusterd_store_create_volume_dir (volinfo);
if (ret)
goto out;


...
}


So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-brick 
sometimes is empty.
It is becaue the info file is not exist, and it will be create by “fd = open 
(path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function gf_store_handle_new.
And the info file is empty before rename.
So the info file is empty if glusterd shutdown before rename.
 



My question is following.
1.I did not find the point the info is removed.Could you tell me the point 
where the info and /bricks/* are removed?
2.why the file info and bricks/* is removed?But other files in 
var/lib/glusterd/vols/xxx/ are not be removed?

AFAIK, we never delete the info file and hence this file is opened with 
O_APPEND flag. As I said I will go back and cross check the code once again.






Thanks,
Xin



在 2016-11-11 20:34:05,"Atin Mukherjee" 

Re: [Gluster-users] question about info and info.tmp

2016-11-16 Thread songxin


Hi Atin,
Thank you for your support.


I have a question for you.


glusterd_store_volinfo() will hidden remove the info and bricks/* by rename().
Why glusterd must remove the info and bricks/* in function 
glusterd_delete_stale_volume() before calling glusterd_store_volinfo()?


Thanks,
Xin





在 2016-11-16 20:59:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:

ok, thank you.





在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:



Hi Atin,


I think the root cause is in the function glusterd_import_friend_volume as 
below. 

int32_t 
glusterd_import_friend_volume (dict_t *peer_data, size_t count) 
{ 
... 
ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo); 
if (0 == ret) { 
(void) gd_check_and_update_rebalance_info (old_volinfo, 
   new_volinfo); 
(void) glusterd_delete_stale_volume (old_volinfo, new_volinfo); 
} 
... 
ret = glusterd_store_volinfo (new_volinfo, 
GLUSTERD_VOLINFO_VER_AC_NONE); 
if (ret) { 
gf_msg (this->name, GF_LOG_ERROR, 0, 
GD_MSG_VOLINFO_STORE_FAIL, "Failed to store " 
"volinfo for volume %s", new_volinfo->volname); 
goto out; 
} 
... 
} 

glusterd_delete_stale_volume will remove the info and bricks/* and the 
glusterd_store_volinfo will create the new one. 
But if glusterd is killed before rename the info will is empty. 


And glusterd will start failed because the infois empty in the next time you 
start the glusterd.


Any idea, Atin?


Give me some time, will check it out, but reading at this analysis looks very 
well possible if a volume is changed when the glusterd was done on node a and 
when the same comes up during peer handshake we update the volinfo and during 
that time glusterd goes down once again. I'll confirm it by tomorrow.



I checked the code and it does look like you have got the right RCA for the 
issue which you simulated through those two scripts. However this can happen 
even when you try to create a fresh volume and while glusterd tries to write 
the content into the store and goes down before renaming the info.tmp file you 
get into the same situation.


I'd really need to think through if this can be fixed. Suggestions are always 
appreciated.

 



BTW, excellent work Xin!




Thanks,
Xin



在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


I really appreciate your help in trying to nail down this issue. While I am at 
your email and going through the code to figure out the possible cause for it, 
unfortunately I don't see any script in the attachment of the bug.  Could you 
please cross check?
 



After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo (glusterd_volinfo_t *volinfo, glusterd_volinfo_ver_ac_t 
ac)
{
int32_t ret = -1;


GF_ASSERT (volinfo)


ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not exit(%d)", 
errno);
}
else
{
ret = stat("/var/lib/glusterd/vols/gv0/info", );
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info 
error");
}
else
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size is 
%lu, inode num is %lu", buf.st_size, buf.st_ino);
}
}


glusterd_perform_volinfo_version_action (volinfo, ac);
ret = glusterd_store_create_volume_dir (volinfo);
if (ret)
goto out;


...
}


So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-brick 
sometimes is empty.
It is becaue the info file is not exist, and it will be create by “fd = open 
(path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function gf_store_handle_new.
And the info file is empty before rename.
So the info file is empty if glusterd shutdown before rename.
 



My question is following.
1.I did not find the point the info is removed.Could you tell me the point 
where the info and /bricks/* are removed?
2.why the file info and bricks/* is removed?But other files in 
var/lib/glusterd/vols/xxx/ are not be removed?

AFAIK, we never 

Re: [Gluster-users] question about info and info.tmp

2016-11-16 Thread Atin Mukherjee
On Tue, Nov 15, 2016 at 1:53 PM, songxin  wrote:

> ok, thank you.
>
>
>
>
> 在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:
>
>
>
> On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:
>
>>
>> Hi Atin,
>>
>> I think the root cause is in the function glusterd_import_friend_volume
>> as below.
>>
>> int32_t
>> glusterd_import_friend_volume (dict_t *peer_data, size_t count)
>> {
>> ...
>> ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo);
>> if (0 == ret) {
>> (void) gd_check_and_update_rebalance_info (old_volinfo,
>>new_volinfo);
>> (void) glusterd_delete_stale_volume (old_volinfo,
>> new_volinfo);
>> }
>> ...
>> ret = glusterd_store_volinfo (new_volinfo,
>> GLUSTERD_VOLINFO_VER_AC_NONE);
>> if (ret) {
>> gf_msg (this->name, GF_LOG_ERROR, 0,
>> GD_MSG_VOLINFO_STORE_FAIL, "Failed to store "
>> "volinfo for volume %s", new_volinfo->volname);
>> goto out;
>> }
>> ...
>> }
>>
>> glusterd_delete_stale_volume will remove the info and bricks/* and the
>> glusterd_store_volinfo will create the new one.
>> But if glusterd is killed before rename the info will is empty.
>>
>> And glusterd will start failed because the infois empty in the next time
>> you start the glusterd.
>>
>> Any idea, Atin?
>>
>
> Give me some time, will check it out, but reading at this analysis looks
> very well possible if a volume is changed when the glusterd was done on
> node a and when the same comes up during peer handshake we update the
> volinfo and during that time glusterd goes down once again. I'll confirm it
> by tomorrow.
>
>
I checked the code and it does look like you have got the right RCA for the
issue which you simulated through those two scripts. However this can
happen even when you try to create a fresh volume and while glusterd tries
to write the content into the store and goes down before renaming the
info.tmp file you get into the same situation.

I'd really need to think through if this can be fixed. Suggestions are
always appreciated.



>
> BTW, excellent work Xin!
>
>
>> Thanks,
>> Xin
>>
>>
>> 在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:
>>
>>
>>
>> On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:
>>
>>> Hi Atin,
>>> I have some clues about this issue.
>>> I could reproduce this issue use the scrip that mentioned in
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .
>>>
>>
>> I really appreciate your help in trying to nail down this issue. While I
>> am at your email and going through the code to figure out the possible
>> cause for it, unfortunately I don't see any script in the attachment of the
>> bug.  Could you please cross check?
>>
>>
>>>
>>> After I added some debug print,which like below, in glusterd-store.c and
>>> I found that the /var/lib/glusterd/vols/xxx/info and
>>> /var/lib/glusterd/vols/xxx/bricks/* are removed.
>>> But other files in /var/lib/glusterd/vols/xxx/ will not be remove.
>>>
>>> int32_t
>>> glusterd_store_volinfo (glusterd_volinfo_t *volinfo,
>>> glusterd_volinfo_ver_ac_t ac)
>>> {
>>> int32_t ret = -1;
>>>
>>> GF_ASSERT (volinfo)
>>>
>>> ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
>>> if(ret < 0)
>>> {
>>> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not
>>> exit(%d)", errno);
>>> }
>>> else
>>> {
>>> ret = stat("/var/lib/glusterd/vols/gv0/info", );
>>> if(ret < 0)
>>> {
>>> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat
>>> info error");
>>> }
>>> else
>>> {
>>> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info
>>> size is %lu, inode num is %lu", buf.st_size, buf.st_ino);
>>> }
>>> }
>>>
>>> glusterd_perform_volinfo_version_action (volinfo, ac);
>>> ret = glusterd_store_create_volume_dir (volinfo);
>>> if (ret)
>>> goto out;
>>>
>>> ...
>>> }
>>>
>>> So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-
>>> brick sometimes is empty.
>>> It is becaue the info file is not exist, and it will be create by “fd =
>>> open (path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function
>>> gf_store_handle_new.
>>> And the info file is empty before rename.
>>> So the info file is empty if glusterd shutdown before rename.
>>>
>>>
>>
>>> My question is following.
>>> 1.I did not find the point the info is removed.Could you tell me the
>>> point where the info and /bricks/* are removed?
>>> 2.why the file info and bricks/* is removed?But other files in 
>>> var/lib/glusterd/vols/xxx/
>>> are not be removed?
>>>
>>
>> AFAIK, we never delete the 

Re: [Gluster-users] question about info and info.tmp

2016-11-15 Thread songxin
ok, thank you.




在 2016-11-15 16:12:34,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:



Hi Atin,


I think the root cause is in the function glusterd_import_friend_volume as 
below. 

int32_t 
glusterd_import_friend_volume (dict_t *peer_data, size_t count) 
{ 
... 
ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo); 
if (0 == ret) { 
(void) gd_check_and_update_rebalance_info (old_volinfo, 
   new_volinfo); 
(void) glusterd_delete_stale_volume (old_volinfo, new_volinfo); 
} 
... 
ret = glusterd_store_volinfo (new_volinfo, 
GLUSTERD_VOLINFO_VER_AC_NONE); 
if (ret) { 
gf_msg (this->name, GF_LOG_ERROR, 0, 
GD_MSG_VOLINFO_STORE_FAIL, "Failed to store " 
"volinfo for volume %s", new_volinfo->volname); 
goto out; 
} 
... 
} 

glusterd_delete_stale_volume will remove the info and bricks/* and the 
glusterd_store_volinfo will create the new one. 
But if glusterd is killed before rename the info will is empty. 


And glusterd will start failed because the infois empty in the next time you 
start the glusterd.


Any idea, Atin?


Give me some time, will check it out, but reading at this analysis looks very 
well possible if a volume is changed when the glusterd was done on node a and 
when the same comes up during peer handshake we update the volinfo and during 
that time glusterd goes down once again. I'll confirm it by tomorrow.


BTW, excellent work Xin!




Thanks,
Xin



在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


I really appreciate your help in trying to nail down this issue. While I am at 
your email and going through the code to figure out the possible cause for it, 
unfortunately I don't see any script in the attachment of the bug.  Could you 
please cross check?
 



After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo (glusterd_volinfo_t *volinfo, glusterd_volinfo_ver_ac_t 
ac)
{
int32_t ret = -1;


GF_ASSERT (volinfo)


ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not exit(%d)", 
errno);
}
else
{
ret = stat("/var/lib/glusterd/vols/gv0/info", );
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info 
error");
}
else
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size is 
%lu, inode num is %lu", buf.st_size, buf.st_ino);
}
}


glusterd_perform_volinfo_version_action (volinfo, ac);
ret = glusterd_store_create_volume_dir (volinfo);
if (ret)
goto out;


...
}


So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-brick 
sometimes is empty.
It is becaue the info file is not exist, and it will be create by “fd = open 
(path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function gf_store_handle_new.
And the info file is empty before rename.
So the info file is empty if glusterd shutdown before rename.
 



My question is following.
1.I did not find the point the info is removed.Could you tell me the point 
where the info and /bricks/* are removed?
2.why the file info and bricks/* is removed?But other files in 
var/lib/glusterd/vols/xxx/ are not be removed?

AFAIK, we never delete the info file and hence this file is opened with 
O_APPEND flag. As I said I will go back and cross check the code once again.






Thanks,
Xin



在 2016-11-11 20:34:05,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 4:00 PM, songxin  wrote:

Hi Atin,



Thank you for your support.
Sincerely wait for your reply.


By the way, could you make sure that the issue, file info is empty, cause by 
rename is interrupted in kernel?


As per my RCA on that bug, it looked to be.
 



Thanks,
Xin

在 2016-11-11 15:49:02,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 1:15 PM, songxin  wrote:

Hi Atin,
Thank you for your reply.
Actually it is very difficult to reproduce because I don't know when there was 
an ongoing commit happening.It is just a coincidence.
But I want to make sure the root 

Re: [Gluster-users] question about info and info.tmp

2016-11-15 Thread Atin Mukherjee
On Tue, Nov 15, 2016 at 12:47 PM, songxin  wrote:

>
> Hi Atin,
>
> I think the root cause is in the function glusterd_import_friend_volume as
> below.
>
> int32_t
> glusterd_import_friend_volume (dict_t *peer_data, size_t count)
> {
> ...
> ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo);
> if (0 == ret) {
> (void) gd_check_and_update_rebalance_info (old_volinfo,
>new_volinfo);
> (void) glusterd_delete_stale_volume (old_volinfo,
> new_volinfo);
> }
> ...
> ret = glusterd_store_volinfo (new_volinfo,
> GLUSTERD_VOLINFO_VER_AC_NONE);
> if (ret) {
> gf_msg (this->name, GF_LOG_ERROR, 0,
> GD_MSG_VOLINFO_STORE_FAIL, "Failed to store "
> "volinfo for volume %s", new_volinfo->volname);
> goto out;
> }
> ...
> }
>
> glusterd_delete_stale_volume will remove the info and bricks/* and the
> glusterd_store_volinfo will create the new one.
> But if glusterd is killed before rename the info will is empty.
>
> And glusterd will start failed because the infois empty in the next time
> you start the glusterd.
>
> Any idea, Atin?
>

Give me some time, will check it out, but reading at this analysis looks
very well possible if a volume is changed when the glusterd was done on
node a and when the same comes up during peer handshake we update the
volinfo and during that time glusterd goes down once again. I'll confirm it
by tomorrow.

BTW, excellent work Xin!


> Thanks,
> Xin
>
>
> 在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:
>
>
>
> On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:
>
>> Hi Atin,
>> I have some clues about this issue.
>> I could reproduce this issue use the scrip that mentioned in
>> https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .
>>
>
> I really appreciate your help in trying to nail down this issue. While I
> am at your email and going through the code to figure out the possible
> cause for it, unfortunately I don't see any script in the attachment of the
> bug.  Could you please cross check?
>
>
>>
>> After I added some debug print,which like below, in glusterd-store.c and
>> I found that the /var/lib/glusterd/vols/xxx/info and
>> /var/lib/glusterd/vols/xxx/bricks/* are removed.
>> But other files in /var/lib/glusterd/vols/xxx/ will not be remove.
>>
>> int32_t
>> glusterd_store_volinfo (glusterd_volinfo_t *volinfo,
>> glusterd_volinfo_ver_ac_t ac)
>> {
>> int32_t ret = -1;
>>
>> GF_ASSERT (volinfo)
>>
>> ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
>> if(ret < 0)
>> {
>> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not
>> exit(%d)", errno);
>> }
>> else
>> {
>> ret = stat("/var/lib/glusterd/vols/gv0/info", );
>> if(ret < 0)
>> {
>> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat
>> info error");
>> }
>> else
>> {
>> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info
>> size is %lu, inode num is %lu", buf.st_size, buf.st_ino);
>> }
>> }
>>
>> glusterd_perform_volinfo_version_action (volinfo, ac);
>> ret = glusterd_store_create_volume_dir (volinfo);
>> if (ret)
>> goto out;
>>
>> ...
>> }
>>
>> So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-
>> brick sometimes is empty.
>> It is becaue the info file is not exist, and it will be create by “fd =
>> open (path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function
>> gf_store_handle_new.
>> And the info file is empty before rename.
>> So the info file is empty if glusterd shutdown before rename.
>>
>>
>
>> My question is following.
>> 1.I did not find the point the info is removed.Could you tell me the
>> point where the info and /bricks/* are removed?
>> 2.why the file info and bricks/* is removed?But other files in 
>> var/lib/glusterd/vols/xxx/
>> are not be removed?
>>
>
> AFAIK, we never delete the info file and hence this file is opened with
> O_APPEND flag. As I said I will go back and cross check the code once again.
>
>
>
>
>> Thanks,
>> Xin
>>
>>
>> 在 2016-11-11 20:34:05,"Atin Mukherjee"  写道:
>>
>>
>>
>> On Fri, Nov 11, 2016 at 4:00 PM, songxin  wrote:
>>
>>> Hi Atin,
>>>
>>> Thank you for your support.
>>> Sincerely wait for your reply.
>>>
>>> By the way, could you make sure that the issue, file info is empty,
>>> cause by rename is interrupted in kernel?
>>>
>>
>> As per my RCA on that bug, it looked to be.
>>
>>
>>>
>>> Thanks,
>>> Xin
>>>
>>> 在 2016-11-11 15:49:02,"Atin Mukherjee"  写道:
>>>
>>>
>>>
>>> On Fri, Nov 11, 2016 at 1:15 PM, songxin 

Re: [Gluster-users] question about info and info.tmp

2016-11-14 Thread songxin


Hi Atin,


I think the root cause is in the function glusterd_import_friend_volume as 
below. 

int32_t 
glusterd_import_friend_volume (dict_t *peer_data, size_t count) 
{ 
... 
ret = glusterd_volinfo_find (new_volinfo->volname, _volinfo); 
if (0 == ret) { 
(void) gd_check_and_update_rebalance_info (old_volinfo, 
   new_volinfo); 
(void) glusterd_delete_stale_volume (old_volinfo, new_volinfo); 
} 
... 
ret = glusterd_store_volinfo (new_volinfo, 
GLUSTERD_VOLINFO_VER_AC_NONE); 
if (ret) { 
gf_msg (this->name, GF_LOG_ERROR, 0, 
GD_MSG_VOLINFO_STORE_FAIL, "Failed to store " 
"volinfo for volume %s", new_volinfo->volname); 
goto out; 
} 
... 
} 

glusterd_delete_stale_volume will remove the info and bricks/* and the 
glusterd_store_volinfo will create the new one. 
But if glusterd is killed before rename the info will is empty. 


And glusterd will start failed because the infois empty in the next time you 
start the glusterd.


Any idea, Atin?


Thanks,
Xin



在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


I really appreciate your help in trying to nail down this issue. While I am at 
your email and going through the code to figure out the possible cause for it, 
unfortunately I don't see any script in the attachment of the bug.  Could you 
please cross check?
 



After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo (glusterd_volinfo_t *volinfo, glusterd_volinfo_ver_ac_t 
ac)
{
int32_t ret = -1;


GF_ASSERT (volinfo)


ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not exit(%d)", 
errno);
}
else
{
ret = stat("/var/lib/glusterd/vols/gv0/info", );
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info 
error");
}
else
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size is 
%lu, inode num is %lu", buf.st_size, buf.st_ino);
}
}


glusterd_perform_volinfo_version_action (volinfo, ac);
ret = glusterd_store_create_volume_dir (volinfo);
if (ret)
goto out;


...
}


So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-brick 
sometimes is empty.
It is becaue the info file is not exist, and it will be create by “fd = open 
(path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function gf_store_handle_new.
And the info file is empty before rename.
So the info file is empty if glusterd shutdown before rename.
 



My question is following.
1.I did not find the point the info is removed.Could you tell me the point 
where the info and /bricks/* are removed?
2.why the file info and bricks/* is removed?But other files in 
var/lib/glusterd/vols/xxx/ are not be removed?

AFAIK, we never delete the info file and hence this file is opened with 
O_APPEND flag. As I said I will go back and cross check the code once again.






Thanks,
Xin



在 2016-11-11 20:34:05,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 4:00 PM, songxin  wrote:

Hi Atin,



Thank you for your support.
Sincerely wait for your reply.


By the way, could you make sure that the issue, file info is empty, cause by 
rename is interrupted in kernel?


As per my RCA on that bug, it looked to be.
 



Thanks,
Xin

在 2016-11-11 15:49:02,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 1:15 PM, songxin  wrote:

Hi Atin,
Thank you for your reply.
Actually it is very difficult to reproduce because I don't know when there was 
an ongoing commit happening.It is just a coincidence.
But I want to make sure the root cause.


I'll give it a another try and see if this situation can be 
simulated/reproduced and will keep you posted.
 



So I would be grateful if you could answer my questions below.


You said that "This issue is hit at part of the negative testing where while 
gluster volume set was executed at the same point of time glusterd in another 
instance was brought down. In the faulty node we could see 
/var/lib/glusterd/vols/info file been empty whereas the info.tmp file 
has the correct 

Re: [Gluster-users] question about info and info.tmp

2016-11-14 Thread songxin
Hi Atin,
Now I have known that the info and bricks/* are removed by the function 
glusterd_delete_stale_volume().
But I have not known how to solve this issue.


Thanks,
Xin






在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


I really appreciate your help in trying to nail down this issue. While I am at 
your email and going through the code to figure out the possible cause for it, 
unfortunately I don't see any script in the attachment of the bug.  Could you 
please cross check?
 



After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo (glusterd_volinfo_t *volinfo, glusterd_volinfo_ver_ac_t 
ac)
{
int32_t ret = -1;


GF_ASSERT (volinfo)


ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not exit(%d)", 
errno);
}
else
{
ret = stat("/var/lib/glusterd/vols/gv0/info", );
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info 
error");
}
else
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size is 
%lu, inode num is %lu", buf.st_size, buf.st_ino);
}
}


glusterd_perform_volinfo_version_action (volinfo, ac);
ret = glusterd_store_create_volume_dir (volinfo);
if (ret)
goto out;


...
}


So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-brick 
sometimes is empty.
It is becaue the info file is not exist, and it will be create by “fd = open 
(path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function gf_store_handle_new.
And the info file is empty before rename.
So the info file is empty if glusterd shutdown before rename.
 



My question is following.
1.I did not find the point the info is removed.Could you tell me the point 
where the info and /bricks/* are removed?
2.why the file info and bricks/* is removed?But other files in 
var/lib/glusterd/vols/xxx/ are not be removed?

AFAIK, we never delete the info file and hence this file is opened with 
O_APPEND flag. As I said I will go back and cross check the code once again.






Thanks,
Xin



在 2016-11-11 20:34:05,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 4:00 PM, songxin  wrote:

Hi Atin,



Thank you for your support.
Sincerely wait for your reply.


By the way, could you make sure that the issue, file info is empty, cause by 
rename is interrupted in kernel?


As per my RCA on that bug, it looked to be.
 



Thanks,
Xin

在 2016-11-11 15:49:02,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 1:15 PM, songxin  wrote:

Hi Atin,
Thank you for your reply.
Actually it is very difficult to reproduce because I don't know when there was 
an ongoing commit happening.It is just a coincidence.
But I want to make sure the root cause.


I'll give it a another try and see if this situation can be 
simulated/reproduced and will keep you posted.
 



So I would be grateful if you could answer my questions below.


You said that "This issue is hit at part of the negative testing where while 
gluster volume set was executed at the same point of time glusterd in another 
instance was brought down. In the faulty node we could see 
/var/lib/glusterd/vols/info file been empty whereas the info.tmp file 
has the correct contents." in comment.
I have two questions for you.

1.Could you reproduce this issue by gluster volume set glusterd which was 
brought down?
2.Could you be certain that this issue is cause by rename is interrupted in 
kernel?
In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, are 
both empty.
But in my view only one rename can be running at the same time because of the 
big lock.
Why there are two files are empty?


Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
running in two thread?
Thanks,
Xin




在 2016-11-11 15:27:03,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 12:38 PM, songxin  wrote:



Hi Atin,
Thank you for your reply.


As you said that the info file can only be changed in the 
glusterd_store_volinfo() sequentially because of the big lock.


I have found the similar issue as below that you mentioned. 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487


Great, so this is what I was 

Re: [Gluster-users] question about info and info.tmp

2016-11-14 Thread songxin


Hi Atin,
I have two nodes, a node and b node, in which creating a replicate volume and 
then start the volume.


I will run the script as below on b node.
#!/bin/bash
i=1   
while(($i<100))
do 
   
gluster volume set gv0 nfs.disable on


sleep 2s


gluster volume set gv0 nfs.disable off  
 
i=$(($i+1))
done


And I  run the script as below on a node at the same time.


#!/bin/bash


i=1
while(($i<100))
do


systemctl stop glusterd


systemctl start glusterd


gluster volume info


i=$(($i+1))
done


The issue is very easy reproduced on a board.


Could you please tell me where is the point that info file is unlink?


Thanks,
Xin








在 2016-11-15 12:07:05,"Atin Mukherjee"  写道:





On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


I really appreciate your help in trying to nail down this issue. While I am at 
your email and going through the code to figure out the possible cause for it, 
unfortunately I don't see any script in the attachment of the bug.  Could you 
please cross check?
 



After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo (glusterd_volinfo_t *volinfo, glusterd_volinfo_ver_ac_t 
ac)
{
int32_t ret = -1;


GF_ASSERT (volinfo)


ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not exit(%d)", 
errno);
}
else
{
ret = stat("/var/lib/glusterd/vols/gv0/info", );
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info 
error");
}
else
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size is 
%lu, inode num is %lu", buf.st_size, buf.st_ino);
}
}


glusterd_perform_volinfo_version_action (volinfo, ac);
ret = glusterd_store_create_volume_dir (volinfo);
if (ret)
goto out;


...
}


So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-brick 
sometimes is empty.
It is becaue the info file is not exist, and it will be create by “fd = open 
(path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function gf_store_handle_new.
And the info file is empty before rename.
So the info file is empty if glusterd shutdown before rename.
 



My question is following.
1.I did not find the point the info is removed.Could you tell me the point 
where the info and /bricks/* are removed?
2.why the file info and bricks/* is removed?But other files in 
var/lib/glusterd/vols/xxx/ are not be removed?

AFAIK, we never delete the info file and hence this file is opened with 
O_APPEND flag. As I said I will go back and cross check the code once again.






Thanks,
Xin



在 2016-11-11 20:34:05,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 4:00 PM, songxin  wrote:

Hi Atin,



Thank you for your support.
Sincerely wait for your reply.


By the way, could you make sure that the issue, file info is empty, cause by 
rename is interrupted in kernel?


As per my RCA on that bug, it looked to be.
 



Thanks,
Xin

在 2016-11-11 15:49:02,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 1:15 PM, songxin  wrote:

Hi Atin,
Thank you for your reply.
Actually it is very difficult to reproduce because I don't know when there was 
an ongoing commit happening.It is just a coincidence.
But I want to make sure the root cause.


I'll give it a another try and see if this situation can be 
simulated/reproduced and will keep you posted.
 



So I would be grateful if you could answer my questions below.


You said that "This issue is hit at part of the negative testing where while 
gluster volume set was executed at the same point of time glusterd in another 
instance was brought down. In the faulty node we could see 
/var/lib/glusterd/vols/info file been empty whereas the info.tmp file 
has the correct contents." in comment.
I have two questions for you.

1.Could you reproduce this issue by gluster volume set glusterd which was 
brought down?
2.Could you be certain that this issue is cause by rename is interrupted in 
kernel?
In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, are 
both empty.
But in my view only one rename can 

Re: [Gluster-users] question about info and info.tmp

2016-11-14 Thread Atin Mukherjee
On Tue, Nov 15, 2016 at 8:58 AM, songxin  wrote:

> Hi Atin,
> I have some clues about this issue.
> I could reproduce this issue use the scrip that mentioned in
> https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .
>

I really appreciate your help in trying to nail down this issue. While I am
at your email and going through the code to figure out the possible cause
for it, unfortunately I don't see any script in the attachment of the bug.
Could you please cross check?


>
> After I added some debug print,which like below, in glusterd-store.c and I
> found that the /var/lib/glusterd/vols/xxx/info and 
> /var/lib/glusterd/vols/xxx/bricks/*
> are removed.
> But other files in /var/lib/glusterd/vols/xxx/ will not be remove.
>
> int32_t
> glusterd_store_volinfo (glusterd_volinfo_t *volinfo,
> glusterd_volinfo_ver_ac_t ac)
> {
> int32_t ret = -1;
>
> GF_ASSERT (volinfo)
>
> ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
> if(ret < 0)
> {
> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not
> exit(%d)", errno);
> }
> else
> {
> ret = stat("/var/lib/glusterd/vols/gv0/info", );
> if(ret < 0)
> {
> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info
> error");
> }
> else
> {
> gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size
> is %lu, inode num is %lu", buf.st_size, buf.st_ino);
> }
> }
>
> glusterd_perform_volinfo_version_action (volinfo, ac);
> ret = glusterd_store_create_volume_dir (volinfo);
> if (ret)
> goto out;
>
> ...
> }
>
> So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-
> brick sometimes is empty.
> It is becaue the info file is not exist, and it will be create by “fd =
> open (path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function
> gf_store_handle_new.
> And the info file is empty before rename.
> So the info file is empty if glusterd shutdown before rename.
>
>

> My question is following.
> 1.I did not find the point the info is removed.Could you tell me the point
> where the info and /bricks/* are removed?
> 2.why the file info and bricks/* is removed?But other files in 
> var/lib/glusterd/vols/xxx/
> are not be removed?
>

AFAIK, we never delete the info file and hence this file is opened with
O_APPEND flag. As I said I will go back and cross check the code once again.




> Thanks,
> Xin
>
>
> 在 2016-11-11 20:34:05,"Atin Mukherjee"  写道:
>
>
>
> On Fri, Nov 11, 2016 at 4:00 PM, songxin  wrote:
>
>> Hi Atin,
>>
>> Thank you for your support.
>> Sincerely wait for your reply.
>>
>> By the way, could you make sure that the issue, file info is empty, cause
>> by rename is interrupted in kernel?
>>
>
> As per my RCA on that bug, it looked to be.
>
>
>>
>> Thanks,
>> Xin
>>
>> 在 2016-11-11 15:49:02,"Atin Mukherjee"  写道:
>>
>>
>>
>> On Fri, Nov 11, 2016 at 1:15 PM, songxin  wrote:
>>
>>> Hi Atin,
>>> Thank you for your reply.
>>> Actually it is very difficult to reproduce because I don't know when there
>>> was an ongoing commit happening.It is just a coincidence.
>>> But I want to make sure the root cause.
>>>
>>
>> I'll give it a another try and see if this situation can be
>> simulated/reproduced and will keep you posted.
>>
>>
>>>
>>> So I would be grateful if you could answer my questions below.
>>>
>>> You said that "This issue is hit at part of the negative testing where
>>> while gluster volume set was executed at the same point of time glusterd in
>>> another instance was brought down. In the faulty node we could see
>>> /var/lib/glusterd/vols/info file been empty whereas the
>>> info.tmp file has the correct contents." in comment.
>>>
>>> I have two questions for you.
>>>
>>> 1.Could you reproduce this issue by gluster volume set glusterd which was 
>>> brought down?
>>> 2.Could you be certain that this issue is cause by rename is interrupted in 
>>> kernel?
>>>
>>> In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, 
>>> are both empty.
>>> But in my view only one rename can be running at the same time because of 
>>> the big lock.
>>> Why there are two files are empty?
>>>
>>>
>>> Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") 
>>> be running in two thread?
>>>
>>> Thanks,
>>> Xin
>>>
>>>
>>> 在 2016-11-11 15:27:03,"Atin Mukherjee"  写道:
>>>
>>>
>>>
>>> On Fri, Nov 11, 2016 at 12:38 PM, songxin  wrote:
>>>

 Hi Atin,
 Thank you for your reply.

 As you said that the info file can only be changed in the 
 glusterd_store_volinfo()
 sequentially because of the big lock.

 I have found the similar issue as below that you 

Re: [Gluster-users] question about info and info.tmp

2016-11-14 Thread songxin
Hi Atin,
I have some clues about this issue.
I could reproduce this issue use the scrip that mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487 .


After I added some debug print,which like below, in glusterd-store.c and I 
found that the /var/lib/glusterd/vols/xxx/info and 
/var/lib/glusterd/vols/xxx/bricks/* are removed. 
But other files in /var/lib/glusterd/vols/xxx/ will not be remove.


int32_t
glusterd_store_volinfo (glusterd_volinfo_t *volinfo, glusterd_volinfo_ver_ac_t 
ac)
{
int32_t ret = -1;


GF_ASSERT (volinfo)


ret = access("/var/lib/glusterd/vols/gv0/info", F_OK);
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info is not exit(%d)", 
errno);
}
else
{
ret = stat("/var/lib/glusterd/vols/gv0/info", );
if(ret < 0)
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "stat info 
error");
}
else
{
gf_msg (THIS->name, GF_LOG_ERROR, 0, 0, "info size is 
%lu, inode num is %lu", buf.st_size, buf.st_ino);
}
}


glusterd_perform_volinfo_version_action (volinfo, ac);
ret = glusterd_store_create_volume_dir (volinfo);
if (ret)
goto out;


...
}


So it is easy to understand why  the info or 10.32.1.144.-opt-lvmdir-c2-brick 
sometimes is empty.
It is becaue the info file is not exist, and it will be create by “fd = open 
(path, O_RDWR | O_CREAT | O_APPEND, 0600);” in function gf_store_handle_new.
And the info file is empty before rename.
So the info file is empty if glusterd shutdown before rename.


My question is following.
1.I did not find the point the info is removed.Could you tell me the point 
where the info and /bricks/* are removed?
2.why the file info and bricks/* is removed?But other files in 
var/lib/glusterd/vols/xxx/ are not be removed?


Thanks,
Xin



在 2016-11-11 20:34:05,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 4:00 PM, songxin  wrote:

Hi Atin,



Thank you for your support.
Sincerely wait for your reply.


By the way, could you make sure that the issue, file info is empty, cause by 
rename is interrupted in kernel?


As per my RCA on that bug, it looked to be.
 



Thanks,
Xin

在 2016-11-11 15:49:02,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 1:15 PM, songxin  wrote:

Hi Atin,
Thank you for your reply.
Actually it is very difficult to reproduce because I don't know when there was 
an ongoing commit happening.It is just a coincidence.
But I want to make sure the root cause.


I'll give it a another try and see if this situation can be 
simulated/reproduced and will keep you posted.
 



So I would be grateful if you could answer my questions below.


You said that "This issue is hit at part of the negative testing where while 
gluster volume set was executed at the same point of time glusterd in another 
instance was brought down. In the faulty node we could see 
/var/lib/glusterd/vols/info file been empty whereas the info.tmp file 
has the correct contents." in comment.
I have two questions for you.

1.Could you reproduce this issue by gluster volume set glusterd which was 
brought down?
2.Could you be certain that this issue is cause by rename is interrupted in 
kernel?
In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, are 
both empty.
But in my view only one rename can be running at the same time because of the 
big lock.
Why there are two files are empty?


Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
running in two thread?
Thanks,
Xin




在 2016-11-11 15:27:03,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 12:38 PM, songxin  wrote:



Hi Atin,
Thank you for your reply.


As you said that the info file can only be changed in the 
glusterd_store_volinfo() sequentially because of the big lock.


I have found the similar issue as below that you mentioned. 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487


Great, so this is what I was actually trying to refer in my first email that I 
saw a similar issue. Have you got a chance to look at 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487#c4 ? But in your case, did 
you try to bring down glusterd when there was an ongoing commit happening?
 



You said that "This issue is hit at part of the negative testing where while 
gluster volume set was executed at the same point of time glusterd in another 
instance was brought down. In the faulty node we could see 
/var/lib/glusterd/vols/info file been empty whereas the info.tmp file 
has the correct contents." in comment.
I have two questions for you.

1.Could you reproduce this issue by gluster volume set glusterd which was 
brought down?
2.Could you be certain that this 

Re: [Gluster-users] question about info and info.tmp

2016-11-11 Thread Atin Mukherjee
On Fri, Nov 11, 2016 at 4:00 PM, songxin  wrote:

> Hi Atin,
>
> Thank you for your support.
> Sincerely wait for your reply.
>
> By the way, could you make sure that the issue, file info is empty, cause
> by rename is interrupted in kernel?
>

As per my RCA on that bug, it looked to be.


>
> Thanks,
> Xin
>
> 在 2016-11-11 15:49:02,"Atin Mukherjee"  写道:
>
>
>
> On Fri, Nov 11, 2016 at 1:15 PM, songxin  wrote:
>
>> Hi Atin,
>> Thank you for your reply.
>> Actually it is very difficult to reproduce because I don't know when there
>> was an ongoing commit happening.It is just a coincidence.
>> But I want to make sure the root cause.
>>
>
> I'll give it a another try and see if this situation can be
> simulated/reproduced and will keep you posted.
>
>
>>
>> So I would be grateful if you could answer my questions below.
>>
>> You said that "This issue is hit at part of the negative testing where
>> while gluster volume set was executed at the same point of time glusterd in
>> another instance was brought down. In the faulty node we could see
>> /var/lib/glusterd/vols/info file been empty whereas the
>> info.tmp file has the correct contents." in comment.
>>
>> I have two questions for you.
>>
>> 1.Could you reproduce this issue by gluster volume set glusterd which was 
>> brought down?
>> 2.Could you be certain that this issue is cause by rename is interrupted in 
>> kernel?
>>
>> In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, 
>> are both empty.
>> But in my view only one rename can be running at the same time because of 
>> the big lock.
>> Why there are two files are empty?
>>
>>
>> Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
>> running in two thread?
>>
>> Thanks,
>> Xin
>>
>>
>> 在 2016-11-11 15:27:03,"Atin Mukherjee"  写道:
>>
>>
>>
>> On Fri, Nov 11, 2016 at 12:38 PM, songxin  wrote:
>>
>>>
>>> Hi Atin,
>>> Thank you for your reply.
>>>
>>> As you said that the info file can only be changed in the 
>>> glusterd_store_volinfo()
>>> sequentially because of the big lock.
>>>
>>> I have found the similar issue as below that you mentioned.
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1308487
>>>
>>
>> Great, so this is what I was actually trying to refer in my first email
>> that I saw a similar issue. Have you got a chance to look at
>> https://bugzilla.redhat.com/show_bug.cgi?id=1308487#c4 ? But in your
>> case, did you try to bring down glusterd when there was an ongoing commit
>> happening?
>>
>>
>>>
>>> You said that "This issue is hit at part of the negative testing where
>>> while gluster volume set was executed at the same point of time glusterd in
>>> another instance was brought down. In the faulty node we could see
>>> /var/lib/glusterd/vols/info file been empty whereas the
>>> info.tmp file has the correct contents." in comment.
>>>
>>> I have two questions for you.
>>>
>>> 1.Could you reproduce this issue by gluster volume set glusterd which was 
>>> brought down?
>>> 2.Could you be certain that this issue is cause by rename is interrupted in 
>>> kernel?
>>>
>>> In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, 
>>> are both empty.
>>> But in my view only one rename can be running at the same time because of 
>>> the big lock.
>>> Why there are two files are empty?
>>>
>>>
>>> Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") 
>>> be running in two thread?
>>>
>>> Thanks,
>>> Xin
>>>
>>>
>>>
>>>
>>> 在 2016-11-11 14:36:40,"Atin Mukherjee"  写道:
>>>
>>>
>>>
>>> On Fri, Nov 11, 2016 at 8:33 AM, songxin  wrote:
>>>
 Hi Atin,

 Thank you for your reply.
 I have two questions for you.

 1.Are the two files info and info.tmp are only to be created or changed
 in function glusterd_store_volinfo()? I did not find other point in which
 the two file are changed.

>>>
>>> If we are talking about info file volume then yes, the mentioned
>>> function actually takes care of it.
>>>
>>>
 2.I found that glusterd_store_volinfo() will be call in many point by
 glusterd.Is there a problem of thread synchronization?If so, one thread may
 open a same file info.tmp using O_TRUNC flag when another thread is
 writing the info,tmp.Could this case happen?

>>>
>>>  In glusterd threads are big lock protected and I don't see a
>>> possibility (theoretically) to have two glusterd_store_volinfo () calls at
>>> a given point of time.
>>>
>>>

 Thanks,
 Xin


 At 2016-11-10 21:41:06, "Atin Mukherjee"  wrote:

 Did you run out of disk space by any chance? AFAIK, the code is like we
 write new stuffs to .tmp file and rename it back to the original file. In
 case of a disk space issue I expect both the files to be of non zero size.
 But having said that 

Re: [Gluster-users] question about info and info.tmp

2016-11-11 Thread songxin
Hi Atin,



Thank you for your support.
Sincerely wait for your reply.


By the way, could you make sure that the issue, file info is empty, cause by 
rename is interrupted in kernel?


Thanks,
Xin

在 2016-11-11 15:49:02,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 1:15 PM, songxin  wrote:

Hi Atin,
Thank you for your reply.
Actually it is very difficult to reproduce because I don't know when there was 
an ongoing commit happening.It is just a coincidence.
But I want to make sure the root cause.


I'll give it a another try and see if this situation can be 
simulated/reproduced and will keep you posted.
 



So I would be grateful if you could answer my questions below.


You said that "This issue is hit at part of the negative testing where while 
gluster volume set was executed at the same point of time glusterd in another 
instance was brought down. In the faulty node we could see 
/var/lib/glusterd/vols/info file been empty whereas the info.tmp file 
has the correct contents." in comment.
I have two questions for you.

1.Could you reproduce this issue by gluster volume set glusterd which was 
brought down?
2.Could you be certain that this issue is cause by rename is interrupted in 
kernel?
In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, are 
both empty.
But in my view only one rename can be running at the same time because of the 
big lock.
Why there are two files are empty?


Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
running in two thread?
Thanks,
Xin




在 2016-11-11 15:27:03,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 12:38 PM, songxin  wrote:



Hi Atin,
Thank you for your reply.


As you said that the info file can only be changed in the 
glusterd_store_volinfo() sequentially because of the big lock.


I have found the similar issue as below that you mentioned. 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487


Great, so this is what I was actually trying to refer in my first email that I 
saw a similar issue. Have you got a chance to look at 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487#c4 ? But in your case, did 
you try to bring down glusterd when there was an ongoing commit happening?
 



You said that "This issue is hit at part of the negative testing where while 
gluster volume set was executed at the same point of time glusterd in another 
instance was brought down. In the faulty node we could see 
/var/lib/glusterd/vols/info file been empty whereas the info.tmp file 
has the correct contents." in comment.
I have two questions for you.

1.Could you reproduce this issue by gluster volume set glusterd which was 
brought down?
2.Could you be certain that this issue is cause by rename is interrupted in 
kernel?
In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, are 
both empty.
But in my view only one rename can be running at the same time because of the 
big lock.
Why there are two files are empty?


Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
running in two thread?

Thanks,
Xin




在 2016-11-11 14:36:40,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 8:33 AM, songxin  wrote:

Hi Atin,


Thank you for your reply.
I have two questions for you.


1.Are the two files info and info.tmp are only to be created or changed in 
function glusterd_store_volinfo()? I did not find other point in which the two 
file are changed.


If we are talking about info file volume then yes, the mentioned function 
actually takes care of it.
 

2.I found that glusterd_store_volinfo() will be call in many point by 
glusterd.Is there a problem of thread synchronization?If so, one thread may 
open a same file info.tmp using O_TRUNC flag when another thread is writing the 
info,tmp.Could this case happen?


 In glusterd threads are big lock protected and I don't see a possibility 
(theoretically) to have two glusterd_store_volinfo () calls at a given point of 
time.
 



Thanks,
Xin



At 2016-11-10 21:41:06, "Atin Mukherjee"  wrote:

Did you run out of disk space by any chance? AFAIK, the code is like we write 
new stuffs to .tmp file and rename it back to the original file. In case of a 
disk space issue I expect both the files to be of non zero size. But having 
said that I vaguely remember a similar issue (in the form of a bug or an email) 
landed up once but we couldn't reproduce it, so something is wrong with the 
atomic update here is what I guess. I'll be glad if you have a reproducer for 
the same and then we can dig into it further.



On Thu, Nov 10, 2016 at 1:32 PM, songxin  wrote:

Hi,
When I start the glusterd some error happened.
And the log is following.

[2016-11-08 07:58:34.989365] I [MSGID: 100030] [glusterfsd.c:2318:main] 
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 

Re: [Gluster-users] question about info and info.tmp

2016-11-10 Thread Atin Mukherjee
On Fri, Nov 11, 2016 at 1:15 PM, songxin  wrote:

> Hi Atin,
> Thank you for your reply.
> Actually it is very difficult to reproduce because I don't know when there
> was an ongoing commit happening.It is just a coincidence.
> But I want to make sure the root cause.
>

I'll give it a another try and see if this situation can be
simulated/reproduced and will keep you posted.


>
> So I would be grateful if you could answer my questions below.
>
> You said that "This issue is hit at part of the negative testing where
> while gluster volume set was executed at the same point of time glusterd in
> another instance was brought down. In the faulty node we could see
> /var/lib/glusterd/vols/info file been empty whereas the info.tmp
> file has the correct contents." in comment.
>
> I have two questions for you.
>
> 1.Could you reproduce this issue by gluster volume set glusterd which was 
> brought down?
> 2.Could you be certain that this issue is cause by rename is interrupted in 
> kernel?
>
> In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, 
> are both empty.
> But in my view only one rename can be running at the same time because of the 
> big lock.
> Why there are two files are empty?
>
>
> Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
> running in two thread?
>
> Thanks,
> Xin
>
>
> 在 2016-11-11 15:27:03,"Atin Mukherjee"  写道:
>
>
>
> On Fri, Nov 11, 2016 at 12:38 PM, songxin  wrote:
>
>>
>> Hi Atin,
>> Thank you for your reply.
>>
>> As you said that the info file can only be changed in the 
>> glusterd_store_volinfo()
>> sequentially because of the big lock.
>>
>> I have found the similar issue as below that you mentioned.
>> https://bugzilla.redhat.com/show_bug.cgi?id=1308487
>>
>
> Great, so this is what I was actually trying to refer in my first email
> that I saw a similar issue. Have you got a chance to look at
> https://bugzilla.redhat.com/show_bug.cgi?id=1308487#c4 ? But in your
> case, did you try to bring down glusterd when there was an ongoing commit
> happening?
>
>
>>
>> You said that "This issue is hit at part of the negative testing where
>> while gluster volume set was executed at the same point of time glusterd in
>> another instance was brought down. In the faulty node we could see
>> /var/lib/glusterd/vols/info file been empty whereas the
>> info.tmp file has the correct contents." in comment.
>>
>> I have two questions for you.
>>
>> 1.Could you reproduce this issue by gluster volume set glusterd which was 
>> brought down?
>> 2.Could you be certain that this issue is cause by rename is interrupted in 
>> kernel?
>>
>> In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, 
>> are both empty.
>> But in my view only one rename can be running at the same time because of 
>> the big lock.
>> Why there are two files are empty?
>>
>>
>> Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
>> running in two thread?
>>
>> Thanks,
>> Xin
>>
>>
>>
>>
>> 在 2016-11-11 14:36:40,"Atin Mukherjee"  写道:
>>
>>
>>
>> On Fri, Nov 11, 2016 at 8:33 AM, songxin  wrote:
>>
>>> Hi Atin,
>>>
>>> Thank you for your reply.
>>> I have two questions for you.
>>>
>>> 1.Are the two files info and info.tmp are only to be created or changed
>>> in function glusterd_store_volinfo()? I did not find other point in which
>>> the two file are changed.
>>>
>>
>> If we are talking about info file volume then yes, the mentioned function
>> actually takes care of it.
>>
>>
>>> 2.I found that glusterd_store_volinfo() will be call in many point by
>>> glusterd.Is there a problem of thread synchronization?If so, one thread may
>>> open a same file info.tmp using O_TRUNC flag when another thread is
>>> writing the info,tmp.Could this case happen?
>>>
>>
>>  In glusterd threads are big lock protected and I don't see a possibility
>> (theoretically) to have two glusterd_store_volinfo () calls at a given
>> point of time.
>>
>>
>>>
>>> Thanks,
>>> Xin
>>>
>>>
>>> At 2016-11-10 21:41:06, "Atin Mukherjee"  wrote:
>>>
>>> Did you run out of disk space by any chance? AFAIK, the code is like we
>>> write new stuffs to .tmp file and rename it back to the original file. In
>>> case of a disk space issue I expect both the files to be of non zero size.
>>> But having said that I vaguely remember a similar issue (in the form of a
>>> bug or an email) landed up once but we couldn't reproduce it, so something
>>> is wrong with the atomic update here is what I guess. I'll be glad if you
>>> have a reproducer for the same and then we can dig into it further.
>>>
>>> On Thu, Nov 10, 2016 at 1:32 PM, songxin  wrote:
>>>
 Hi,
 When I start the glusterd some error happened.
 And the log is following.

 [2016-11-08 07:58:34.989365] I [MSGID: 100030] [glusterfsd.c:2318:main]
 

Re: [Gluster-users] question about info and info.tmp

2016-11-10 Thread songxin
Hi Atin,
Thank you for your reply.
Actually it is very difficult to reproduce because I don't know when there was 
an ongoing commit happening.It is just a coincidence.
But I want to make sure the root cause.


So I would be grateful if you could answer my questions below.


You said that "This issue is hit at part of the negative testing where while 
gluster volume set was executed at the same point of time glusterd in another 
instance was brought down. In the faulty node we could see 
/var/lib/glusterd/vols/info file been empty whereas the info.tmp file 
has the correct contents." in comment.
I have two questions for you.

1.Could you reproduce this issue by gluster volume set glusterd which was 
brought down?
2.Could you be certain that this issue is cause by rename is interrupted in 
kernel?
In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, are 
both empty.
But in my view only one rename can be running at the same time because of the 
big lock.
Why there are two files are empty?


Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
running in two thread?
Thanks,
Xin



在 2016-11-11 15:27:03,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 12:38 PM, songxin  wrote:



Hi Atin,
Thank you for your reply.


As you said that the info file can only be changed in the 
glusterd_store_volinfo() sequentially because of the big lock.


I have found the similar issue as below that you mentioned. 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487


Great, so this is what I was actually trying to refer in my first email that I 
saw a similar issue. Have you got a chance to look at 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487#c4 ? But in your case, did 
you try to bring down glusterd when there was an ongoing commit happening?
 



You said that "This issue is hit at part of the negative testing where while 
gluster volume set was executed at the same point of time glusterd in another 
instance was brought down. In the faulty node we could see 
/var/lib/glusterd/vols/info file been empty whereas the info.tmp file 
has the correct contents." in comment.
I have two questions for you.

1.Could you reproduce this issue by gluster volume set glusterd which was 
brought down?
2.Could you be certain that this issue is cause by rename is interrupted in 
kernel?
In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, are 
both empty.
But in my view only one rename can be running at the same time because of the 
big lock.
Why there are two files are empty?


Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
running in two thread?

Thanks,
Xin




在 2016-11-11 14:36:40,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 8:33 AM, songxin  wrote:

Hi Atin,


Thank you for your reply.
I have two questions for you.


1.Are the two files info and info.tmp are only to be created or changed in 
function glusterd_store_volinfo()? I did not find other point in which the two 
file are changed.


If we are talking about info file volume then yes, the mentioned function 
actually takes care of it.
 

2.I found that glusterd_store_volinfo() will be call in many point by 
glusterd.Is there a problem of thread synchronization?If so, one thread may 
open a same file info.tmp using O_TRUNC flag when another thread is writing the 
info,tmp.Could this case happen?


 In glusterd threads are big lock protected and I don't see a possibility 
(theoretically) to have two glusterd_store_volinfo () calls at a given point of 
time.
 



Thanks,
Xin



At 2016-11-10 21:41:06, "Atin Mukherjee"  wrote:

Did you run out of disk space by any chance? AFAIK, the code is like we write 
new stuffs to .tmp file and rename it back to the original file. In case of a 
disk space issue I expect both the files to be of non zero size. But having 
said that I vaguely remember a similar issue (in the form of a bug or an email) 
landed up once but we couldn't reproduce it, so something is wrong with the 
atomic update here is what I guess. I'll be glad if you have a reproducer for 
the same and then we can dig into it further.



On Thu, Nov 10, 2016 at 1:32 PM, songxin  wrote:

Hi,
When I start the glusterd some error happened.
And the log is following.

[2016-11-08 07:58:34.989365] I [MSGID: 100030] [glusterfsd.c:2318:main] 
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6 (args: 
/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) 
[2016-11-08 07:58:34.998356] I [MSGID: 106478] [glusterd.c:1350:init] 
0-management: Maximum allowed open file descriptors set to 65536 
[2016-11-08 07:58:35.000667] I [MSGID: 106479] [glusterd.c:1399:init] 
0-management: Using /system/glusterd as working directory
[2016-11-08 07:58:35.024508] I [MSGID: 106514] 
[glusterd-store.c:2075:glusterd_restore_op_version] 

Re: [Gluster-users] question about info and info.tmp

2016-11-10 Thread Atin Mukherjee
On Fri, Nov 11, 2016 at 12:38 PM, songxin  wrote:

>
> Hi Atin,
> Thank you for your reply.
>
> As you said that the info file can only be changed in the 
> glusterd_store_volinfo()
> sequentially because of the big lock.
>
> I have found the similar issue as below that you mentioned.
> https://bugzilla.redhat.com/show_bug.cgi?id=1308487
>

Great, so this is what I was actually trying to refer in my first email
that I saw a similar issue. Have you got a chance to look at
https://bugzilla.redhat.com/show_bug.cgi?id=1308487#c4 ? But in your case,
did you try to bring down glusterd when there was an ongoing commit
happening?


>
> You said that "This issue is hit at part of the negative testing where
> while gluster volume set was executed at the same point of time glusterd in
> another instance was brought down. In the faulty node we could see
> /var/lib/glusterd/vols/info file been empty whereas the info.tmp
> file has the correct contents." in comment.
>
> I have two questions for you.
>
> 1.Could you reproduce this issue by gluster volume set glusterd which was 
> brought down?
> 2.Could you be certain that this issue is cause by rename is interrupted in 
> kernel?
>
> In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, 
> are both empty.
> But in my view only one rename can be running at the same time because of the 
> big lock.
> Why there are two files are empty?
>
>
> Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
> running in two thread?
>
> Thanks,
> Xin
>
>
>
>
> 在 2016-11-11 14:36:40,"Atin Mukherjee"  写道:
>
>
>
> On Fri, Nov 11, 2016 at 8:33 AM, songxin  wrote:
>
>> Hi Atin,
>>
>> Thank you for your reply.
>> I have two questions for you.
>>
>> 1.Are the two files info and info.tmp are only to be created or changed
>> in function glusterd_store_volinfo()? I did not find other point in which
>> the two file are changed.
>>
>
> If we are talking about info file volume then yes, the mentioned function
> actually takes care of it.
>
>
>> 2.I found that glusterd_store_volinfo() will be call in many point by
>> glusterd.Is there a problem of thread synchronization?If so, one thread may
>> open a same file info.tmp using O_TRUNC flag when another thread is
>> writing the info,tmp.Could this case happen?
>>
>
>  In glusterd threads are big lock protected and I don't see a possibility
> (theoretically) to have two glusterd_store_volinfo () calls at a given
> point of time.
>
>
>>
>> Thanks,
>> Xin
>>
>>
>> At 2016-11-10 21:41:06, "Atin Mukherjee"  wrote:
>>
>> Did you run out of disk space by any chance? AFAIK, the code is like we
>> write new stuffs to .tmp file and rename it back to the original file. In
>> case of a disk space issue I expect both the files to be of non zero size.
>> But having said that I vaguely remember a similar issue (in the form of a
>> bug or an email) landed up once but we couldn't reproduce it, so something
>> is wrong with the atomic update here is what I guess. I'll be glad if you
>> have a reproducer for the same and then we can dig into it further.
>>
>> On Thu, Nov 10, 2016 at 1:32 PM, songxin  wrote:
>>
>>> Hi,
>>> When I start the glusterd some error happened.
>>> And the log is following.
>>>
>>> [2016-11-08 07:58:34.989365] I [MSGID: 100030] [glusterfsd.c:2318:main]
>>> 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6
>>> (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
>>> [2016-11-08 07:58:34.998356] I [MSGID: 106478] [glusterd.c:1350:init]
>>> 0-management: Maximum allowed open file descriptors set to 65536
>>> [2016-11-08 07:58:35.000667] I [MSGID: 106479] [glusterd.c:1399:init]
>>> 0-management: Using /system/glusterd as working directory
>>> [2016-11-08 07:58:35.024508] I [MSGID: 106514]
>>> [glusterd-store.c:2075:glusterd_restore_op_version] 0-management:
>>> Upgrade detected. Setting op-version to minimum : 1
>>> *[2016-11-08 07:58:35.025356] E [MSGID: 106206]
>>> [glusterd-store.c:2562:glusterd_store_update_volinfo] 0-management: Failed
>>> to get next store iter *
>>> *[2016-11-08 07:58:35.025401] E [MSGID: 106207]
>>> [glusterd-store.c:2844:glusterd_store_retrieve_volume] 0-management: Failed
>>> to update volinfo for c_glusterfs volume *
>>> *[2016-11-08 07:58:35.025463] E [MSGID: 106201]
>>> [glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management:
>>> Unable to restore volume: c_glusterfs *
>>> *[2016-11-08 07:58:35.025544] E [MSGID: 101019]
>>> [xlator.c:428:xlator_init] 0-management: Initialization of volume
>>> 'management' failed, review your volfile again *
>>> *[2016-11-08 07:58:35.025582] E [graph.c:322:glusterfs_graph_init]
>>> 0-management: initializing translator failed *
>>> *[2016-11-08 07:58:35.025629] E [graph.c:661:glusterfs_graph_activate]
>>> 0-graph: init failed *
>>> [2016-11-08 07:58:35.026109] W 

Re: [Gluster-users] question about info and info.tmp

2016-11-10 Thread songxin


Hi Atin,
Thank you for your reply.


As you said that the info file can only be changed in the 
glusterd_store_volinfo() sequentially because of the big lock.


I have found the similar issue as below that you mentioned. 
https://bugzilla.redhat.com/show_bug.cgi?id=1308487


You said that "This issue is hit at part of the negative testing where while 
gluster volume set was executed at the same point of time glusterd in another 
instance was brought down. In the faulty node we could see 
/var/lib/glusterd/vols/info file been empty whereas the info.tmp file 
has the correct contents." in comment.

I have two questions for you.

1.Could you reproduce this issue by gluster volume set glusterd which was 
brought down?
2.Could you be certain that this issue is cause by rename is interrupted in 
kernel?

In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, are 
both empty.
But in my view only one rename can be running at the same time because of the 
big lock.
Why there are two files are empty?


Could rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") be 
running in two thread?

Thanks,
Xin




在 2016-11-11 14:36:40,"Atin Mukherjee"  写道:





On Fri, Nov 11, 2016 at 8:33 AM, songxin  wrote:

Hi Atin,


Thank you for your reply.
I have two questions for you.


1.Are the two files info and info.tmp are only to be created or changed in 
function glusterd_store_volinfo()? I did not find other point in which the two 
file are changed.


If we are talking about info file volume then yes, the mentioned function 
actually takes care of it.
 

2.I found that glusterd_store_volinfo() will be call in many point by 
glusterd.Is there a problem of thread synchronization?If so, one thread may 
open a same file info.tmp using O_TRUNC flag when another thread is writing the 
info,tmp.Could this case happen?


 In glusterd threads are big lock protected and I don't see a possibility 
(theoretically) to have two glusterd_store_volinfo () calls at a given point of 
time.
 



Thanks,
Xin



At 2016-11-10 21:41:06, "Atin Mukherjee"  wrote:

Did you run out of disk space by any chance? AFAIK, the code is like we write 
new stuffs to .tmp file and rename it back to the original file. In case of a 
disk space issue I expect both the files to be of non zero size. But having 
said that I vaguely remember a similar issue (in the form of a bug or an email) 
landed up once but we couldn't reproduce it, so something is wrong with the 
atomic update here is what I guess. I'll be glad if you have a reproducer for 
the same and then we can dig into it further.



On Thu, Nov 10, 2016 at 1:32 PM, songxin  wrote:

Hi,
When I start the glusterd some error happened.
And the log is following.

[2016-11-08 07:58:34.989365] I [MSGID: 100030] [glusterfsd.c:2318:main] 
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6 (args: 
/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) 
[2016-11-08 07:58:34.998356] I [MSGID: 106478] [glusterd.c:1350:init] 
0-management: Maximum allowed open file descriptors set to 65536 
[2016-11-08 07:58:35.000667] I [MSGID: 106479] [glusterd.c:1399:init] 
0-management: Using /system/glusterd as working directory
[2016-11-08 07:58:35.024508] I [MSGID: 106514] 
[glusterd-store.c:2075:glusterd_restore_op_version] 0-management: Upgrade 
detected. Setting op-version to minimum : 1 
[2016-11-08 07:58:35.025356] E [MSGID: 106206] 
[glusterd-store.c:2562:glusterd_store_update_volinfo] 0-management: Failed to 
get next store iter 
[2016-11-08 07:58:35.025401] E [MSGID: 106207] 
[glusterd-store.c:2844:glusterd_store_retrieve_volume] 0-management: Failed to 
update volinfo for c_glusterfs volume 
[2016-11-08 07:58:35.025463] E [MSGID: 106201] 
[glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management: Unable to 
restore volume: c_glusterfs 
[2016-11-08 07:58:35.025544] E [MSGID: 101019] [xlator.c:428:xlator_init] 
0-management: Initialization of volume 'management' failed, review your volfile 
again 
[2016-11-08 07:58:35.025582] E [graph.c:322:glusterfs_graph_init] 0-management: 
initializing translator failed 
[2016-11-08 07:58:35.025629] E [graph.c:661:glusterfs_graph_activate] 0-graph: 
init failed 
[2016-11-08 07:58:35.026109] W [glusterfsd.c:1236:cleanup_and_exit] 
(-->/usr/sbin/glusterd(glusterfs_volumes_init-0x1b260) [0x1000a718] 
-->/usr/sbin/glusterd(glusterfs_process_volfp-0x1b3b8) [0x1000a5a8] 
-->/usr/sbin/glusterd(cleanup_and_exit-0x1c02c) [0x100098bc] ) 0-: received 
signum (0), shutting down 




And then I found that the size of vols/volume_name/info is 0.It cause glusterd 
shutdown.
But I found that vols/volume_name_info.tmp is not 0.
And I found that there is a brick file vols/volume_name/bricks/.brick is 0, 
but vols/volume_name/bricks/.brick.tmp is not 0.


I read the function code glusterd_store_volinfo () in glusterd-store.c .
I know that 

Re: [Gluster-users] question about info and info.tmp

2016-11-10 Thread Atin Mukherjee
On Fri, Nov 11, 2016 at 8:33 AM, songxin  wrote:

> Hi Atin,
>
> Thank you for your reply.
> I have two questions for you.
>
> 1.Are the two files info and info.tmp are only to be created or changed in
> function glusterd_store_volinfo()? I did not find other point in which the
> two file are changed.
>

If we are talking about info file volume then yes, the mentioned function
actually takes care of it.


> 2.I found that glusterd_store_volinfo() will be call in many point by
> glusterd.Is there a problem of thread synchronization?If so, one thread may
> open a same file info.tmp using O_TRUNC flag when another thread is
> writing the info,tmp.Could this case happen?
>

 In glusterd threads are big lock protected and I don't see a possibility
(theoretically) to have two glusterd_store_volinfo () calls at a given
point of time.


>
> Thanks,
> Xin
>
>
> At 2016-11-10 21:41:06, "Atin Mukherjee"  wrote:
>
> Did you run out of disk space by any chance? AFAIK, the code is like we
> write new stuffs to .tmp file and rename it back to the original file. In
> case of a disk space issue I expect both the files to be of non zero size.
> But having said that I vaguely remember a similar issue (in the form of a
> bug or an email) landed up once but we couldn't reproduce it, so something
> is wrong with the atomic update here is what I guess. I'll be glad if you
> have a reproducer for the same and then we can dig into it further.
>
> On Thu, Nov 10, 2016 at 1:32 PM, songxin  wrote:
>
>> Hi,
>> When I start the glusterd some error happened.
>> And the log is following.
>>
>> [2016-11-08 07:58:34.989365] I [MSGID: 100030] [glusterfsd.c:2318:main]
>> 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6
>> (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
>> [2016-11-08 07:58:34.998356] I [MSGID: 106478] [glusterd.c:1350:init]
>> 0-management: Maximum allowed open file descriptors set to 65536
>> [2016-11-08 07:58:35.000667] I [MSGID: 106479] [glusterd.c:1399:init]
>> 0-management: Using /system/glusterd as working directory
>> [2016-11-08 07:58:35.024508] I [MSGID: 106514]
>> [glusterd-store.c:2075:glusterd_restore_op_version] 0-management:
>> Upgrade detected. Setting op-version to minimum : 1
>> *[2016-11-08 07:58:35.025356] E [MSGID: 106206]
>> [glusterd-store.c:2562:glusterd_store_update_volinfo] 0-management: Failed
>> to get next store iter *
>> *[2016-11-08 07:58:35.025401] E [MSGID: 106207]
>> [glusterd-store.c:2844:glusterd_store_retrieve_volume] 0-management: Failed
>> to update volinfo for c_glusterfs volume *
>> *[2016-11-08 07:58:35.025463] E [MSGID: 106201]
>> [glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management:
>> Unable to restore volume: c_glusterfs *
>> *[2016-11-08 07:58:35.025544] E [MSGID: 101019]
>> [xlator.c:428:xlator_init] 0-management: Initialization of volume
>> 'management' failed, review your volfile again *
>> *[2016-11-08 07:58:35.025582] E [graph.c:322:glusterfs_graph_init]
>> 0-management: initializing translator failed *
>> *[2016-11-08 07:58:35.025629] E [graph.c:661:glusterfs_graph_activate]
>> 0-graph: init failed *
>> [2016-11-08 07:58:35.026109] W [glusterfsd.c:1236:cleanup_and_exit]
>> (-->/usr/sbin/glusterd(glusterfs_volumes_init-0x1b260) [0x1000a718]
>> -->/usr/sbin/glusterd(glusterfs_process_volfp-0x1b3b8) [0x1000a5a8]
>> -->/usr/sbin/glusterd(cleanup_and_exit-0x1c02c) [0x100098bc] ) 0-:
>> received signum (0), shutting down
>>
>>
>> And then I found that the size of vols/volume_name/info is 0.It cause
>> glusterd shutdown.
>> But I found that vols/volume_name_info.tmp is not 0.
>> And I found that there is a brick file vols/volume_name/bricks/.brick
>> is 0, but vols/volume_name/bricks/.brick.tmp is not 0.
>>
>> I read the function code glusterd_store_volinfo () in glusterd-store.c .
>> I know that the info.tmp will be rename to info in function
>> glusterd_store_volume_atomic_update().
>>
>> But my question is that why the info file is 0 but info.tmp is not 0.
>>
>>
>> Thanks,
>> Xin
>>
>>
>>
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
> --
>
> ~ Atin (atinm)
>
>
>
>
>



-- 

~ Atin (atinm)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] question about info and info.tmp

2016-11-10 Thread songxin
Hi Atin,


Thank you for your reply.
I have two questions for you.


1.Are the two files info and info.tmp are only to be created or changed in 
function glusterd_store_volinfo()? I did not find other point in which the two 
file are changed.
2.I found that glusterd_store_volinfo() will be call in many point by 
glusterd.Is there a problem of thread synchronization?If so, one thread may 
open a same file info.tmp using O_TRUNC flag when another thread is writing the 
info,tmp.Could this case happen?


Thanks,
Xin



At 2016-11-10 21:41:06, "Atin Mukherjee"  wrote:

Did you run out of disk space by any chance? AFAIK, the code is like we write 
new stuffs to .tmp file and rename it back to the original file. In case of a 
disk space issue I expect both the files to be of non zero size. But having 
said that I vaguely remember a similar issue (in the form of a bug or an email) 
landed up once but we couldn't reproduce it, so something is wrong with the 
atomic update here is what I guess. I'll be glad if you have a reproducer for 
the same and then we can dig into it further.



On Thu, Nov 10, 2016 at 1:32 PM, songxin  wrote:

Hi,
When I start the glusterd some error happened.
And the log is following.

[2016-11-08 07:58:34.989365] I [MSGID: 100030] [glusterfsd.c:2318:main] 
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6 (args: 
/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) 
[2016-11-08 07:58:34.998356] I [MSGID: 106478] [glusterd.c:1350:init] 
0-management: Maximum allowed open file descriptors set to 65536 
[2016-11-08 07:58:35.000667] I [MSGID: 106479] [glusterd.c:1399:init] 
0-management: Using /system/glusterd as working directory
[2016-11-08 07:58:35.024508] I [MSGID: 106514] 
[glusterd-store.c:2075:glusterd_restore_op_version] 0-management: Upgrade 
detected. Setting op-version to minimum : 1 
[2016-11-08 07:58:35.025356] E [MSGID: 106206] 
[glusterd-store.c:2562:glusterd_store_update_volinfo] 0-management: Failed to 
get next store iter 
[2016-11-08 07:58:35.025401] E [MSGID: 106207] 
[glusterd-store.c:2844:glusterd_store_retrieve_volume] 0-management: Failed to 
update volinfo for c_glusterfs volume 
[2016-11-08 07:58:35.025463] E [MSGID: 106201] 
[glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management: Unable to 
restore volume: c_glusterfs 
[2016-11-08 07:58:35.025544] E [MSGID: 101019] [xlator.c:428:xlator_init] 
0-management: Initialization of volume 'management' failed, review your volfile 
again 
[2016-11-08 07:58:35.025582] E [graph.c:322:glusterfs_graph_init] 0-management: 
initializing translator failed 
[2016-11-08 07:58:35.025629] E [graph.c:661:glusterfs_graph_activate] 0-graph: 
init failed 
[2016-11-08 07:58:35.026109] W [glusterfsd.c:1236:cleanup_and_exit] 
(-->/usr/sbin/glusterd(glusterfs_volumes_init-0x1b260) [0x1000a718] 
-->/usr/sbin/glusterd(glusterfs_process_volfp-0x1b3b8) [0x1000a5a8] 
-->/usr/sbin/glusterd(cleanup_and_exit-0x1c02c) [0x100098bc] ) 0-: received 
signum (0), shutting down 




And then I found that the size of vols/volume_name/info is 0.It cause glusterd 
shutdown.
But I found that vols/volume_name_info.tmp is not 0.
And I found that there is a brick file vols/volume_name/bricks/.brick is 0, 
but vols/volume_name/bricks/.brick.tmp is not 0.


I read the function code glusterd_store_volinfo () in glusterd-store.c .
I know that the info.tmp will be rename to info in function 
glusterd_store_volume_atomic_update().


But my question is that why the info file is 0 but info.tmp is not 0.




Thanks,
Xin




 


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users




--



~ Atin (atinm)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] question about info and info.tmp

2016-11-10 Thread Atin Mukherjee
Did you run out of disk space by any chance? AFAIK, the code is like we
write new stuffs to .tmp file and rename it back to the original file. In
case of a disk space issue I expect both the files to be of non zero size.
But having said that I vaguely remember a similar issue (in the form of a
bug or an email) landed up once but we couldn't reproduce it, so something
is wrong with the atomic update here is what I guess. I'll be glad if you
have a reproducer for the same and then we can dig into it further.

On Thu, Nov 10, 2016 at 1:32 PM, songxin  wrote:

> Hi,
> When I start the glusterd some error happened.
> And the log is following.
>
> [2016-11-08 07:58:34.989365] I [MSGID: 100030] [glusterfsd.c:2318:main]
> 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6
> (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
> [2016-11-08 07:58:34.998356] I [MSGID: 106478] [glusterd.c:1350:init]
> 0-management: Maximum allowed open file descriptors set to 65536
> [2016-11-08 07:58:35.000667] I [MSGID: 106479] [glusterd.c:1399:init]
> 0-management: Using /system/glusterd as working directory
> [2016-11-08 07:58:35.024508] I [MSGID: 106514] 
> [glusterd-store.c:2075:glusterd_restore_op_version]
> 0-management: Upgrade detected. Setting op-version to minimum : 1
> *[2016-11-08 07:58:35.025356] E [MSGID: 106206]
> [glusterd-store.c:2562:glusterd_store_update_volinfo] 0-management: Failed
> to get next store iter *
> *[2016-11-08 07:58:35.025401] E [MSGID: 106207]
> [glusterd-store.c:2844:glusterd_store_retrieve_volume] 0-management: Failed
> to update volinfo for c_glusterfs volume *
> *[2016-11-08 07:58:35.025463] E [MSGID: 106201]
> [glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management:
> Unable to restore volume: c_glusterfs *
> *[2016-11-08 07:58:35.025544] E [MSGID: 101019] [xlator.c:428:xlator_init]
> 0-management: Initialization of volume 'management' failed, review your
> volfile again *
> *[2016-11-08 07:58:35.025582] E [graph.c:322:glusterfs_graph_init]
> 0-management: initializing translator failed *
> *[2016-11-08 07:58:35.025629] E [graph.c:661:glusterfs_graph_activate]
> 0-graph: init failed *
> [2016-11-08 07:58:35.026109] W [glusterfsd.c:1236:cleanup_and_exit]
> (-->/usr/sbin/glusterd(glusterfs_volumes_init-0x1b260) [0x1000a718]
> -->/usr/sbin/glusterd(glusterfs_process_volfp-0x1b3b8) [0x1000a5a8]
> -->/usr/sbin/glusterd(cleanup_and_exit-0x1c02c) [0x100098bc] ) 0-:
> received signum (0), shutting down
>
>
> And then I found that the size of vols/volume_name/info is 0.It cause
> glusterd shutdown.
> But I found that vols/volume_name_info.tmp is not 0.
> And I found that there is a brick file vols/volume_name/bricks/.brick
> is 0, but vols/volume_name/bricks/.brick.tmp is not 0.
>
> I read the function code glusterd_store_volinfo () in glusterd-store.c .
> I know that the info.tmp will be rename to info in function
> glusterd_store_volume_atomic_update().
>
> But my question is that why the info file is 0 but info.tmp is not 0.
>
>
> Thanks,
> Xin
>
>
>
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>



-- 

~ Atin (atinm)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] question about info and info.tmp

2016-11-10 Thread songxin
Hi,
When I start the glusterd some error happened.
And the log is following.

[2016-11-08 07:58:34.989365] I [MSGID: 100030] [glusterfsd.c:2318:main] 
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6 (args: 
/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) 
[2016-11-08 07:58:34.998356] I [MSGID: 106478] [glusterd.c:1350:init] 
0-management: Maximum allowed open file descriptors set to 65536 
[2016-11-08 07:58:35.000667] I [MSGID: 106479] [glusterd.c:1399:init] 
0-management: Using /system/glusterd as working directory
[2016-11-08 07:58:35.024508] I [MSGID: 106514] 
[glusterd-store.c:2075:glusterd_restore_op_version] 0-management: Upgrade 
detected. Setting op-version to minimum : 1 
[2016-11-08 07:58:35.025356] E [MSGID: 106206] 
[glusterd-store.c:2562:glusterd_store_update_volinfo] 0-management: Failed to 
get next store iter 
[2016-11-08 07:58:35.025401] E [MSGID: 106207] 
[glusterd-store.c:2844:glusterd_store_retrieve_volume] 0-management: Failed to 
update volinfo for c_glusterfs volume 
[2016-11-08 07:58:35.025463] E [MSGID: 106201] 
[glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management: Unable to 
restore volume: c_glusterfs 
[2016-11-08 07:58:35.025544] E [MSGID: 101019] [xlator.c:428:xlator_init] 
0-management: Initialization of volume 'management' failed, review your volfile 
again 
[2016-11-08 07:58:35.025582] E [graph.c:322:glusterfs_graph_init] 0-management: 
initializing translator failed 
[2016-11-08 07:58:35.025629] E [graph.c:661:glusterfs_graph_activate] 0-graph: 
init failed 
[2016-11-08 07:58:35.026109] W [glusterfsd.c:1236:cleanup_and_exit] 
(-->/usr/sbin/glusterd(glusterfs_volumes_init-0x1b260) [0x1000a718] 
-->/usr/sbin/glusterd(glusterfs_process_volfp-0x1b3b8) [0x1000a5a8] 
-->/usr/sbin/glusterd(cleanup_and_exit-0x1c02c) [0x100098bc] ) 0-: received 
signum (0), shutting down 




And then I found that the size of vols/volume_name/info is 0.It cause glusterd 
shutdown.
But I found that vols/volume_name_info.tmp is not 0.
And I found that there is a brick file vols/volume_name/bricks/.brick is 0, 
but vols/volume_name/bricks/.brick.tmp is not 0.


I read the function code glusterd_store_volinfo () in glusterd-store.c .
I know that the info.tmp will be rename to info in function 
glusterd_store_volume_atomic_update().


But my question is that why the info file is 0 but info.tmp is not 0.




Thanks,
Xin___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users