a problem about FileStore::_destroy_collection

2015-11-15 Thread yangruifeng.09...@h3c.com
an ENOTEMPTY error mybe happen when removing a pg in previous versions,but the 
error is hidden in new versions。

_destroy_collection maybe return 0 when get_index or prep_delete return < 0;

is this intended?

int FileStore::_destroy_collection(coll_t c) 
{
  int r = 0; //global r
  char fn[PATH_MAX];
  get_cdir(c, fn, sizeof(fn));
  dout(15) << "_destroy_collection " << fn << dendl;
  {
Index from;
int r = get_index(c, &from);//local r
if (r < 0)
  goto out;
assert(NULL != from.index);
RWLock::WLocker l((from.index)->access_lock);

r = from->prep_delete();
if (r < 0)
  goto out;
  }
  r = ::rmdir(fn);
  if (r < 0) {
r = -errno;
goto out;
  }

 out:
  // destroy parallel temp collection, too
  ...

 out_final:
  dout(10) << "_destroy_collection " << fn << " = " << r << dendl;
  return r;
}


答复: 答复: 答复: 答复: 答复: 答复: another peering stuck caused by net problem.

2015-11-02 Thread yangruifeng.09...@h3c.com
root@ceph:~# uname -a
Linux ceph 3.16.0-44-generic #59~14.04.1-Ubuntu SMP Tue Jul 7 15:07:27 UTC 2015 
x86_64 x86_64 x86_64 GNU/Linux
root@ceph:~# cat /etc/issue
Ubuntu 14.04.2 LTS \n \l

thanks
Ruifeng Yang.

-邮件原件-
发件人: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] 
代表 Samuel Just
发送时间: 2015年11月3日 10:15
收件人: yangruifeng 09209 (RD)
抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com); 
ceph-devel@vger.kernel.org
主题: Re: 答复: 答复: 答复: 答复: 答复: another peering stuck caused by net problem.

Exactly what kernel are you using?
-Sam

On Mon, Nov 2, 2015 at 6:14 PM, Samuel Just  wrote:
> Yeah, there's a heartbeat system and the messenger is reliable delivery.
> -Sam
>
> On Mon, Nov 2, 2015 at 5:41 PM, yangruifeng.09...@h3c.com 
>  wrote:
>> I will try my best to get the detailed log.
>> In the current version, we can ensure the messages that are related to 
>> peering is correctly received by peers?
>>
>> thanks
>> Ruifeng Yang.
>>
>> -邮件原件-
>> 发件人: Samuel Just [mailto:sj...@redhat.com]
>> 发送时间: 2015年11月3日 9:28
>> 收件人: yangruifeng 09209 (RD)
>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com); 
>> ceph-devel@vger.kernel.org
>> 主题: Re: 答复: 答复: 答复: 答复: another peering stuck caused by net problem.
>>
>> Temporary network failures should be handled correctly.  The best solution 
>> is to actually fix that bug then.  Capture logging on all involved osds 
>> while it is hung and open a bug:
>>
>> debug osd = 20
>> debug filestore = 20
>> debug ms = 1
>> -Sam
>>
>> On Mon, Nov 2, 2015 at 5:24 PM, yangruifeng.09...@h3c.com 
>>  wrote:
>>> a unknown reason problem, which cause pg stuck in peering, may be a 
>>> temporary failure network failure or other bug.
>>> BUT it can be solved by *manual* 'ceph osd down '
>>>
>>> -邮件原件-
>>> 发件人: ceph-devel-ow...@vger.kernel.org 
>>> [mailto:ceph-devel-ow...@vger.kernel.org] 代表 Samuel Just
>>> 发送时间: 2015年11月3日 9:12
>>> 收件人: yangruifeng 09209 (RD)
>>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com); 
>>> ceph-devel@vger.kernel.org
>>> 主题: Re: 答复: 答复: 答复: another peering stuck caused by net problem.
>>>
>>> The problem is that peering shouldn't hang for no reason.  If you 
>>> are seeing peering hang for a long time either
>>> 1) you are hitting a peering bug which we need to track down and fix
>>> 2) peering actually cannot make progress.
>>>
>>> In case 1, it can be nice to have a work around to force peering to restart 
>>> and avoid the bug.  However, case 2 would not be helped by restarting 
>>> peering, you'd just end up in the same place.  If you did it based on a 
>>> timeout, you'd just increase load by a ton when in that situation.  What 
>>> problem are you trying to solve?
>>> -Sam
>>>
>>> On Mon, Nov 2, 2015 at 5:05 PM, yangruifeng.09...@h3c.com 
>>>  wrote:
>>>> ok.
>>>>
>>>> thanks
>>>> Ruifeng Yang
>>>>
>>>> -邮件原件-
>>>> 发件人: Samuel Just [mailto:sj...@redhat.com]
>>>> 发送时间: 2015年11月3日 9:03
>>>> 收件人: yangruifeng 09209 (RD)
>>>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com)
>>>> 主题: Re: 答复: 答复: another peering stuck caused by net problem.
>>>>
>>>> Would it be ok if I reply to the list as well?
>>>> -Sam
>>>>
>>>> On Mon, Nov 2, 2015 at 4:37 PM, yangruifeng.09...@h3c.com 
>>>>  wrote:
>>>>> the cluster is maybe always peering in same exceptional cases, but 
>>>>> it can return to normal by *manual* 'ceph osd down ', this 
>>>>> is not convenient in a production environment, and against the concept of 
>>>>> rados.
>>>>> add a timeout mechanism to kick it, or kick it when io hang, maybe 
>>>>> reasonable?
>>>>>
>>>>> thanks,
>>>>> Ruifeng Yang
>>>>>
>>>>> -邮件原件-
>>>>> 发件人: Samuel Just [mailto:sj...@redhat.com]
>>>>> 发送时间: 2015年11月3日 2:21
>>>>> 收件人: yangruifeng 09209 (RD)
>>>>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com)
>>>>> 主题: Re: 答复: another peering stuck caused by net problem.
>>>>>
>>>>> I mean issue 'ceph osd down ' for the primary on the pg.  But that 
>>>>> only causes peering to restart.  If peering stalled prev

答复: 答复: 答复: 答复: 答复: another peering stuck caused by net problem.

2015-11-02 Thread yangruifeng.09...@h3c.com
I will try my best to get the detailed log.
In the current version, we can ensure the messages that are related to peering 
is correctly received by peers?  

thanks
Ruifeng Yang.

-邮件原件-
发件人: Samuel Just [mailto:sj...@redhat.com] 
发送时间: 2015年11月3日 9:28
收件人: yangruifeng 09209 (RD)
抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com); 
ceph-devel@vger.kernel.org
主题: Re: 答复: 答复: 答复: 答复: another peering stuck caused by net problem.

Temporary network failures should be handled correctly.  The best solution is 
to actually fix that bug then.  Capture logging on all involved osds while it 
is hung and open a bug:

debug osd = 20
debug filestore = 20
debug ms = 1
-Sam

On Mon, Nov 2, 2015 at 5:24 PM, yangruifeng.09...@h3c.com 
 wrote:
> a unknown reason problem, which cause pg stuck in peering, may be a temporary 
> failure network failure or other bug.
> BUT it can be solved by *manual* 'ceph osd down '
>
> -邮件原件-
> 发件人: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] 代表 Samuel Just
> 发送时间: 2015年11月3日 9:12
> 收件人: yangruifeng 09209 (RD)
> 抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com); 
> ceph-devel@vger.kernel.org
> 主题: Re: 答复: 答复: 答复: another peering stuck caused by net problem.
>
> The problem is that peering shouldn't hang for no reason.  If you are 
> seeing peering hang for a long time either
> 1) you are hitting a peering bug which we need to track down and fix
> 2) peering actually cannot make progress.
>
> In case 1, it can be nice to have a work around to force peering to restart 
> and avoid the bug.  However, case 2 would not be helped by restarting 
> peering, you'd just end up in the same place.  If you did it based on a 
> timeout, you'd just increase load by a ton when in that situation.  What 
> problem are you trying to solve?
> -Sam
>
> On Mon, Nov 2, 2015 at 5:05 PM, yangruifeng.09...@h3c.com 
>  wrote:
>> ok.
>>
>> thanks
>> Ruifeng Yang
>>
>> -邮件原件-
>> 发件人: Samuel Just [mailto:sj...@redhat.com]
>> 发送时间: 2015年11月3日 9:03
>> 收件人: yangruifeng 09209 (RD)
>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com)
>> 主题: Re: 答复: 答复: another peering stuck caused by net problem.
>>
>> Would it be ok if I reply to the list as well?
>> -Sam
>>
>> On Mon, Nov 2, 2015 at 4:37 PM, yangruifeng.09...@h3c.com 
>>  wrote:
>>> the cluster is maybe always peering in same exceptional cases, but 
>>> it can return to normal by *manual* 'ceph osd down ', this is 
>>> not convenient in a production environment, and against the concept of 
>>> rados.
>>> add a timeout mechanism to kick it, or kick it when io hang, maybe 
>>> reasonable?
>>>
>>> thanks,
>>> Ruifeng Yang
>>>
>>> -邮件原件-
>>> 发件人: Samuel Just [mailto:sj...@redhat.com]
>>> 发送时间: 2015年11月3日 2:21
>>> 收件人: yangruifeng 09209 (RD)
>>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com)
>>> 主题: Re: 答复: another peering stuck caused by net problem.
>>>
>>> I mean issue 'ceph osd down ' for the primary on the pg.  But that 
>>> only causes peering to restart.  If peering stalled previously, it'll 
>>> probably stall again.  What are you trying to accomplish?
>>> -Sam
>>>
>>> On Fri, Oct 30, 2015 at 5:51 PM, yangruifeng.09...@h3c.com 
>>>  wrote:
>>>> do you mean restart primary osd? or any other command?
>>>>
>>>> thanks
>>>> Ruifeng Yang
>>>>
>>>> -邮件原件-
>>>> 发件人: Samuel Just [mailto:sj...@redhat.com]
>>>> 发送时间: 2015年10月30日 23:07
>>>> 收件人: chenxiaowei 11245 (RD)
>>>> 抄送: Sage Weil (sw...@redhat.com); yangruifeng 09209 (RD)
>>>> 主题: Re: another peering stuck caused by net problem.
>>>>
>>>> How would that help?  As a way to work around a possible bug?  You can 
>>>> accomplish pretty much the same thing by setting the primary down.
>>>> -Sam
>>>>
>>>> On Wed, Oct 28, 2015 at 8:22 PM, Chenxiaowei  wrote:
>>>>> Hi, Samuel&Sage:
>>>>> I am cxwshawn from H3C(belong to HP), the pg peering stuck 
>>>>> problem is a serious problem especially under the production environment, 
>>>>> So here we came up two solutions:
>>>>> if state Peering stuck too long, we can check timeout 
>>>>> exceeds to force transition from Peering to Reset state, Or we can add a 
>>>>> command line to force one pg from Peering stuck to Re

答复: 答复: 答复: 答复: another peering stuck caused by net problem.

2015-11-02 Thread yangruifeng.09...@h3c.com
a unknown reason problem, which cause pg stuck in peering, may be a temporary 
failure network failure or other bug.
BUT it can be solved by *manual* 'ceph osd down '

-邮件原件-
发件人: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] 
代表 Samuel Just
发送时间: 2015年11月3日 9:12
收件人: yangruifeng 09209 (RD)
抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com); 
ceph-devel@vger.kernel.org
主题: Re: 答复: 答复: 答复: another peering stuck caused by net problem.

The problem is that peering shouldn't hang for no reason.  If you are seeing 
peering hang for a long time either
1) you are hitting a peering bug which we need to track down and fix
2) peering actually cannot make progress.

In case 1, it can be nice to have a work around to force peering to restart and 
avoid the bug.  However, case 2 would not be helped by restarting peering, 
you'd just end up in the same place.  If you did it based on a timeout, you'd 
just increase load by a ton when in that situation.  What problem are you 
trying to solve?
-Sam

On Mon, Nov 2, 2015 at 5:05 PM, yangruifeng.09...@h3c.com 
 wrote:
> ok.
>
> thanks
> Ruifeng Yang
>
> -邮件原件-
> 发件人: Samuel Just [mailto:sj...@redhat.com]
> 发送时间: 2015年11月3日 9:03
> 收件人: yangruifeng 09209 (RD)
> 抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com)
> 主题: Re: 答复: 答复: another peering stuck caused by net problem.
>
> Would it be ok if I reply to the list as well?
> -Sam
>
> On Mon, Nov 2, 2015 at 4:37 PM, yangruifeng.09...@h3c.com 
>  wrote:
>> the cluster is maybe always peering in same exceptional cases, but it 
>> can return to normal by *manual* 'ceph osd down ', this is not 
>> convenient in a production environment, and against the concept of rados.
>> add a timeout mechanism to kick it, or kick it when io hang, maybe 
>> reasonable?
>>
>> thanks,
>> Ruifeng Yang
>>
>> -邮件原件-
>> 发件人: Samuel Just [mailto:sj...@redhat.com]
>> 发送时间: 2015年11月3日 2:21
>> 收件人: yangruifeng 09209 (RD)
>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sw...@redhat.com)
>> 主题: Re: 答复: another peering stuck caused by net problem.
>>
>> I mean issue 'ceph osd down ' for the primary on the pg.  But that 
>> only causes peering to restart.  If peering stalled previously, it'll 
>> probably stall again.  What are you trying to accomplish?
>> -Sam
>>
>> On Fri, Oct 30, 2015 at 5:51 PM, yangruifeng.09...@h3c.com 
>>  wrote:
>>> do you mean restart primary osd? or any other command?
>>>
>>> thanks
>>> Ruifeng Yang
>>>
>>> -邮件原件-
>>> 发件人: Samuel Just [mailto:sj...@redhat.com]
>>> 发送时间: 2015年10月30日 23:07
>>> 收件人: chenxiaowei 11245 (RD)
>>> 抄送: Sage Weil (sw...@redhat.com); yangruifeng 09209 (RD)
>>> 主题: Re: another peering stuck caused by net problem.
>>>
>>> How would that help?  As a way to work around a possible bug?  You can 
>>> accomplish pretty much the same thing by setting the primary down.
>>> -Sam
>>>
>>> On Wed, Oct 28, 2015 at 8:22 PM, Chenxiaowei  wrote:
>>>> Hi, Samuel&Sage:
>>>> I am cxwshawn from H3C(belong to HP), the pg peering stuck 
>>>> problem is a serious problem especially under the production environment, 
>>>> So here we came up two solutions:
>>>> if state Peering stuck too long, we can check timeout 
>>>> exceeds to force transition from Peering to Reset state, Or we can add a 
>>>> command line to force one pg from Peering stuck to Reset state.
>>>>
>>>> What's your advice? Wish your reply
>>>>
>>>> Yours
>>>> shawn from Beijing, China.
>>>>
>>>> ---
>>>> -
>>>> -
>>>> -
>>>> ---
>>>> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
>>>> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
>>>> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
>>>> 邮件!
>>>> This e-mail and its attachments contain confidential information 
>>>> from H3C, which is intended only for the person or entity whose 
>>>> address is listed above. Any use of the information contained 
>>>> herein in any way (including, but not limited to, total or partial 
>>>> disclosure, reproduction, or dissemination) by persons other than 
>>>> the intended
>>>> recipient(s) is prohibited. If you receive this e-mail in error, 
>>>> please notify the sender by phone or email immediately and delete it!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

答复: osds full, delete stuck.

2015-09-15 Thread yangruifeng.09...@h3c.com
a simple solution,

in config_opts.h, add a configuration item osd_op_force_delete, default is 
false or true?

in the class OpReques,add two function need_skip_full_check(), 
set_skip_full_check()

in the OSD::init_op_flags, add checking, if there is a delete op and 
osd_op_force_delete is ture, call set_skip_full_check.

in the OSD::handle_op, when checking full, we can skip it by 
need_skip_full_check().

Is that OK?

Ruifeng Yang

-邮件原件-
发件人: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] 
代表 Sage Weil
发送时间: 2015年9月16日 9:43
收件人: chenxiaowei 11245 (RD)
抄送: ceph-devel@vger.kernel.org
主题: Re: osds full, delete stuck.

On Wed, 16 Sep 2015, Chenxiaowei wrote:
> 
> Hi, Sage:
> 
>  Lately my team run into a problem: when osds full, the 
> delete/write request stuck.
> 
> But we all agree that the cluster should not stuck when sending delete 
> request, cause after that maybe the osds can
> 
> write again. So what?s your advice? Wish your reply, thanks.

The workaround is to temporarily raise the full threshold so you can do some 
deletes.  But you're right that a rados delete operation should be accepted in 
this state.  I've opened

http://tracker.ceph.com/issues/13110

sage


about the strtok in the process of creating swift acl policy

2015-07-28 Thread yangruifeng.09...@h3c.com
Hi,

Is the strtok safe in this case ? why not strtok_r ?

static int parse_list(string& uid_list, vector& uids)
{
  ...
  const char *p = strtok(s, " ,");
  while (p) {
...
p = strtok(NULL, " ,");
  }
  ...
}

bool RGWAccessControlPolicy_SWIFT::create(RGWRados *store, string& id, string& 
name, string& read_list, string& write_list)
{
  ...
int r = parse_list(read_list, uids);
  ...
}

Thanks

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html