Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

Azad Aliyar Mon, 09 Mar 2015 03:51:30 -0700

*Check Max Threadcount:* If you have a node with a lot of OSDs, you may be
hitting the default maximum number of threads (e.g., usually 32k),
especially during recovery. You can increase the number of threads using
sysctl to see if increasing the maximum number of threads to the maximum
possible number of threads allowed (i.e., 4194303) will help. For example:


sysctl -w kernel.pid_max=4194303

 If increasing the maximum thread count resolves the issue, you can make it
permanent by including a kernel.pid_max setting in the /etc/sysctl.conf
file. For example:

kernel.pid_max = 4194303


On Mon, Mar 9, 2015 at 4:11 PM, Karan Singh <karan.si...@csc.fi> wrote:

> Hello Community need help to fix a long going Ceph problem.
>
> Cluster is unhealthy , Multiple OSDs are DOWN. When i am trying to restart
> OSD’s i am getting this error
>
>
> *2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc
> <http://Thread.cc>: In function 'void Thread::create(size_t)' thread
> 7f760dac9700 time 2015-03-09 12:22:16.311970*
> *common/Thread.cc <http://Thread.cc>: 129: FAILED assert(ret == 0)*
>
>
> *Environment *:  4 Nodes , OSD+Monitor , Firefly latest , CentOS6.5 ,
> 3.17.2-1.el6.elrepo.x86_64
>
> Tried upgrading from 0.80.7 to 0.80.8  but no Luck
>
> Tried centOS stock kernel 2.6.32  but no Luck
>
> Memory is not a problem more then 150+GB is free
>
>
> Did any one every faced this problem ??
>
> *Cluster status *
>
>  *  cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33*
> *     health HEALTH_WARN 7334 pgs degraded; 1185 pgs down; 1 pgs
> incomplete; 1735 pgs peering; 8938 pgs stale; 1*
> *736 pgs stuck inactive; 8938 pgs stuck stale; 10320 pgs stuck unclean;
> recovery 6061/31080 objects degraded (19*
> *.501%); 111/196 in osds are down; clock skew detected on mon.pouta-s02,
> mon.pouta-s03*
> *     monmap e3: 3 mons at
> {pouta-s01=10.XXX.50.1:6789/0,pouta-s02=10.XXX.50.2:6789/0,pouta-s03=10.XXX.50.3:6789*
> */0}, election epoch 1312, quorum 0,1,2 pouta-s01,pouta-s02,pouta-s03*
> *     osdmap e26633: 239 osds: 85 up, 196 in*
> *      pgmap v60389: 17408 pgs, 13 pools, 42345 MB data, 10360 objects*
> *            4699 GB used, 707 TB / 711 TB avail*
> *            6061/31080 objects degraded (19.501%)*
> *                  14 down+remapped+peering*
> *                  39 active*
> *                3289 active+clean*
> *                 547 peering*
> *                 663 stale+down+peering*
> *                 705 stale+active+remapped*
> *                   1 active+degraded+remapped*
> *                   1 stale+down+incomplete*
> *                 484 down+peering*
> *                 455 active+remapped*
> *                3696 stale+active+degraded*
> *                   4 remapped+peering*
> *                  23 stale+down+remapped+peering*
> *                  51 stale+active*
> *                3637 active+degraded*
> *                3799 stale+active+clean*
>
> *OSD :  Logs *
>
> *2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc
> <http://Thread.cc>: In function 'void Thread::create(size_t)' thread
> 7f760dac9700 time 2015-03-09 12:22:16.311970*
> *common/Thread.cc <http://Thread.cc>: 129: FAILED assert(ret == 0)*
>
> * ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)*
> * 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]*
> * 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae84fa]*
> * 3: (Accepter::entry()+0x265) [0xb5c635]*
> * 4: /lib64/libpthread.so.0() [0x3c8a6079d1]*
> * 5: (clone()+0x6d) [0x3c8a2e89dd]*
> * NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.*
>
>
> *More information at Ceph Tracker Issue :  *
> http://tracker.ceph.com/issues/10988#change-49018
>
>
> ****************************************************************
> Karan Singh
> Systems Specialist , Storage Platforms
> CSC - IT Center for Science,
> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
> mobile: +358 503 812758
> tel. +358 9 4572001
> fax +358 9 4572302
> http://www.csc.fi/
> ****************************************************************
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
   Warm Regards,  Azad Aliyar
 Linux Server Engineer
 *Email* :  azad.ali...@sparksupport.com   *|*   *Skype* :   spark.azad
<http://www.sparksupport.com> <http://www.sparkmycloud.com>
<https://www.facebook.com/sparksupport>
<http://www.linkedin.com/company/244846>  <https://twitter.com/sparksupport>
3rd Floor, Leela Infopark, Phase -2,Kakanad, Kochi-30, Kerala, India
*Phone*:+91 484 6561696 , *Mobile*:91-8129270421.   *Confidentiality
Notice:* Information in this e-mail is proprietary to SparkSupport. and is
intended for use only by the addressed, and may contain information that is
privileged, confidential or exempt from disclosure. If you are not the
intended recipient, you are notified that any use of this information in
any manner is strictly prohibited. Please delete this mail & notify us
immediately at i...@sparksupport.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

Reply via email to