Thanks Sage I will create a “new feature” request on tracker.ceph.com <http://tracker.ceph.com/> so that this discussion should not get buried under mailing list.
Developers can implement this as per their convenience. **************************************************************** Karan Singh Systems Specialist , Storage Platforms CSC - IT Center for Science, Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland mobile: +358 503 812758 tel. +358 9 4572001 fax +358 9 4572302 http://www.csc.fi/ **************************************************************** > On 10 Mar 2015, at 14:26, Sage Weil <s...@newdream.net> wrote: > > On Tue, 10 Mar 2015, Christian Eichelmann wrote: >> Hi Sage, >> >> we hit this problem a few monthes ago as well and it took us quite a while to >> figure out what's wrong. >> >> As a Systemadministrator I don't like the idea that daemons or even init >> scripts are changing system wide configuration parameters, so I wouldn't like >> to see the OSDs do it themself. > > This is my general feeling as well. As we move to systemd, I'd like to > have the ceph unit file get away from this entirely and have the admin set > these values in /etc/security/limits.conf or /etc/sysctl.d. The main > thing making this problematic right now is that the daemons run as root > instead of a 'ceph' user. > >> The idea with the warning is on one hand a good hint, on the other hand it >> also may confuse people, since changing this setting is not required for >> common hardware. > > If we make it warn only if it reaches > 50% of the threshold that is > probably safe... > > sage > > >> >> Regards, >> Christian >> >> On 03/09/2015 08:01 PM, Sage Weil wrote: >>> On Mon, 9 Mar 2015, Karan Singh wrote: >>>> Thanks Guys kernel.pid_max=4194303 did the trick. >>> Great to hear! Sorry we missed that you only had it at 65536. >>> >>> This is a really common problem that people hit when their clusters start >>> to grow. Is there somewhere in the docs we can put this to catch more >>> users? Or maybe a warning issued by the osds themselves or something if >>> they see limits that are low? >>> >>> sage >>> >>>> - Karan - >>>> >>>> On 09 Mar 2015, at 14:48, Christian Eichelmann >>>> <christian.eichelm...@1und1.de> wrote: >>>> >>>> Hi Karan, >>>> >>>> as you are actually writing in your own book, the problem is the >>>> sysctl >>>> setting "kernel.pid_max". I've seen in your bug report that you were >>>> setting it to 65536, which is still to low for high density hardware. >>>> >>>> In our cluster, one OSD server has in an idle situation about 66.000 >>>> Threads (60 OSDs per Server). The number of threads increases when you >>>> increase the number of placement groups in the cluster, which I think >>>> has triggered your problem. >>>> >>>> Set the "kernel.pid_max" setting to 4194303 (the maximum) like Azad >>>> Aliyar suggested, and the problem should be gone. >>>> >>>> Regards, >>>> Christian >>>> >>>> Am 09.03.2015 11:41, schrieb Karan Singh: >>>> Hello Community need help to fix a long going Ceph >>>> problem. >>>> >>>> Cluster is unhealthy , Multiple OSDs are DOWN. When i am >>>> trying to >>>> restart OSD?s i am getting this error >>>> >>>> >>>> /2015-03-09 12:22:16.312774 7f760dac9700 -1 >>>> common/Thread.cc >>>> <http://Thread.cc>: In function 'void >>>> Thread::create(size_t)' thread >>>> 7f760dac9700 time 2015-03-09 12:22:16.311970/ >>>> /common/Thread.cc <http://Thread.cc>: 129: FAILED >>>> assert(ret == 0)/ >>>> >>>> >>>> *Environment *: 4 Nodes , OSD+Monitor , Firefly latest , >>>> CentOS6.5 >>>> , 3.17.2-1.el6.elrepo.x86_64 >>>> >>>> Tried upgrading from 0.80.7 to 0.80.8 but no Luck >>>> >>>> Tried centOS stock kernel 2.6.32 but no Luck >>>> >>>> Memory is not a problem more then 150+GB is free >>>> >>>> >>>> Did any one every faced this problem ?? >>>> >>>> *Cluster status * >>>> * >>>> * >>>> / cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33/ >>>> / health HEALTH_WARN 7334 pgs degraded; 1185 pgs down; >>>> 1 pgs >>>> incomplete; 1735 pgs peering; 8938 pgs stale; 1/ >>>> /736 pgs stuck inactive; 8938 pgs stuck stale; 10320 pgs >>>> stuck unclean; >>>> recovery 6061/31080 objects degraded (19/ >>>> /.501%); 111/196 in osds are down; clock skew detected on >>>> mon.pouta-s02, >>>> mon.pouta-s03/ >>>> / monmap e3: 3 mons at >>>> {pouta-s01=10.XXX.50.1:6789/0,pouta-s02=10.XXX.50.2:6789/0,pouta-s03=10.XXX >>>> .50.3:6789/ >>>> //0}, election epoch 1312, quorum 0,1,2 >>>> pouta-s01,pouta-s02,pouta-s03/ >>>> / * osdmap e26633: 239 osds: 85 up, 196 in*/ >>>> / pgmap v60389: 17408 pgs, 13 pools, 42345 MB data, >>>> 10360 objects/ >>>> / 4699 GB used, 707 TB / 711 TB avail/ >>>> / 6061/31080 objects degraded (19.501%)/ >>>> / 14 down+remapped+peering/ >>>> / 39 active/ >>>> / 3289 active+clean/ >>>> / 547 peering/ >>>> / 663 stale+down+peering/ >>>> / 705 stale+active+remapped/ >>>> / 1 active+degraded+remapped/ >>>> / 1 stale+down+incomplete/ >>>> / 484 down+peering/ >>>> / 455 active+remapped/ >>>> / 3696 stale+active+degraded/ >>>> / 4 remapped+peering/ >>>> / 23 stale+down+remapped+peering/ >>>> / 51 stale+active/ >>>> / 3637 active+degraded/ >>>> / 3799 stale+active+clean/ >>>> >>>> *OSD : Logs * >>>> >>>> /2015-03-09 12:22:16.312774 7f760dac9700 -1 >>>> common/Thread.cc >>>> <http://Thread.cc>: In function 'void >>>> Thread::create(size_t)' thread >>>> 7f760dac9700 time 2015-03-09 12:22:16.311970/ >>>> /common/Thread.cc <http://Thread.cc>: 129: FAILED >>>> assert(ret == 0)/ >>>> / >>>> / >>>> / ceph version 0.80.8 >>>> (69eaad7f8308f21573c604f121956e64679a52a7)/ >>>> / 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]/ >>>> / 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) >>>> [0xae84fa]/ >>>> / 3: (Accepter::entry()+0x265) [0xb5c635]/ >>>> / 4: /lib64/libpthread.so.0() [0x3c8a6079d1]/ >>>> / 5: (clone()+0x6d) [0x3c8a2e89dd]/ >>>> / NOTE: a copy of the executable, or `objdump -rdS >>>> <executable>` is >>>> needed to interpret this./ >>>> >>>> >>>> *More information at Ceph Tracker Issue : >>>> *http://tracker.ceph.com/issues/10988#change-49018 >>>> >>>> >>>> **************************************************************** >>>> Karan Singh >>>> Systems Specialist , Storage Platforms >>>> CSC - IT Center for Science, >>>> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland >>>> mobile: +358 503 812758 >>>> tel. +358 9 4572001 >>>> fax +358 9 4572302 >>>> http://www.csc.fi/ >>>> **************************************************************** >>>> >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> >>>> -- >>>> Christian Eichelmann >>>> Systemadministrator >>>> >>>> 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting >>>> Brauerstraße 48 · DE-76135 Karlsruhe >>>> Telefon: +49 721 91374-8026 >>>> christian.eichelm...@1und1.de >>>> >>>> Amtsgericht Montabaur / HRB 6484 >>>> Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert >>>> Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan >>>> Oetjen >>>> Aufsichtsratsvorsitzender: Michael Scheeren >>>> >>>> >>>> >>
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com