Re: [ceph-users] OSDs are crashing with "Cannot fork" or "cannot create thread" but plenty of memory is left

Nathan O'Sullivan Mon, 22 Sep 2014 18:22:18 -0700

Hi Christian,

Your problem is probably that your kernel.pid_max (the maximumthreads+processes across the entire system) needs to be increased - thedefault is 32768, which is too low for even a medium densitydeployment. You can test this easily enough with


$ ps axms | wc -l

If you get a number around the 30,000 mark then you are going to beaffected.

There's an issue here http://tracker.ceph.com/issues/6142 , although itdoesn't seem to have gotten much traction in terms of informing users.


Regards
Nathan

On 15/09/2014 7:13 PM, Christian Eichelmann wrote:

Hi all,

I have no idea why running out of filehandles should produce a "out of
memory" error, but well. I've increased the ulimit as you told me, and
nothing changed. I've noticed that the osd init script sets the max open
file handles explicitly, so I was setting the corresponding option in my
ceph conf. Now the limits of an OSD process look like this:

Limit                     Soft Limit           Hard Limit
Units
Max cpu time              unlimited            unlimited
seconds
Max file size             unlimited            unlimited
bytes
Max data size             unlimited            unlimited
bytes
Max stack size            8388608              unlimited
bytes
Max core file size        unlimited            unlimited
bytes
Max resident set          unlimited            unlimited
bytes
Max processes             2067478              2067478
processes
Max open files            65536                65536
files
Max locked memory         65536                65536
bytes
Max address space         unlimited            unlimited
bytes
Max file locks            unlimited            unlimited
locks
Max pending signals       2067478              2067478
signals
Max msgqueue size         819200               819200
bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Anyways, the exact same behavior as before. I was also finding a mailing
on this list from someone who had the exact same problem:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040059.html

Unfortunately, there was also no real solution for this problem.

So again: this is *NOT* a ulimit issue. We were running emperor and
dumpling on the same hardware without any issues. They first started
after our upgrade to firefly.

Regards,
Christian


Am 12.09.2014 18:26, schrieb Christian Balzer:

On Fri, 12 Sep 2014 12:05:06 -0400 Brian Rak wrote:

That's not how ulimit works.  Check the `ulimit -a` output.

Indeed.

And to forestall the next questions, see "man initscript", mine looks like
this:
---
ulimit -Hn 131072
ulimit -Sn 65536

# Execute the program.
eval exec "$4"
---

And also a /etc/security/limits.d/tuning.conf (debian) like this:
---
root            soft    nofile          65536
root            hard    nofile          131072
*               soft    nofile          16384
*               hard    nofile          65536
---

Adjusted to your actual needs. There might be other limits you're hitting,
but that is the most likely one

Also 45 OSDs with 12 (24 with HT, bleah) CPU cores is pretty ballsy.
I personally would rather do 4 RAID6 (10 disks, with OSD SSD journals)
with that kind of case and enjoy the fact that my OSDs never fail. ^o^

Christian (another one)

On 9/12/2014 10:15 AM, Christian Eichelmann wrote:

Hi,

I am running all commands as root, so there are no limits for the
processes.

Regards,
Christian
_______________________________________
Von: Mariusz Gronczewski [mariusz.gronczew...@efigence.com]
Gesendet: Freitag, 12. September 2014 15:33
An: Christian Eichelmann
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] OSDs are crashing with "Cannot fork" or
"cannot create thread" but plenty of memory is left

do cat /proc/<pid>/limits

probably you hit max processes limit or max FD limit

Hi Ceph-Users,

I have absolutely no idea what is going on on my systems...

Hardware:
45 x 4TB Harddisks
2 x 6 Core CPUs
256GB Memory

When initializing all disks and join them to the cluster, after
approximately 30 OSDs, other osds are crashing. When I try to start
them again I see different kinds of errors. For example:


Starting Ceph osd.316 on ceph-osd-bs04...already running
=== osd.317 ===
Traceback (most recent call last):
    File "/usr/bin/ceph", line 830, in <module>
      sys.exit(main())
    File "/usr/bin/ceph", line 773, in main
      sigdict, inbuf, verbose)
    File "/usr/bin/ceph", line 420, in new_style_command
      inbuf=inbuf)
    File "/usr/lib/python2.7/dist-packages/ceph_argparse.py", line
1112, in json_command
      raise RuntimeError('"{0}": exception {1}'.format(cmd, e))
NameError: global name 'cmd' is not defined
Exception thread.error: error("can't start new thread",) in <bound
method Rados.__del__ of <rados.Rados object
at 0x29ee410>> ignored


or:
/etc/init.d/ceph: 190: /etc/init.d/ceph: Cannot fork
/etc/init.d/ceph: 191: /etc/init.d/ceph: Cannot fork
/etc/init.d/ceph: 192: /etc/init.d/ceph: Cannot fork

or:
/usr/bin/ceph-crush-location: 72: /usr/bin/ceph-crush-location:
Cannot fork /usr/bin/ceph-crush-location:
79: /usr/bin/ceph-crush-location: Cannot fork Thread::try_create():
pthread_create failed with error 11common/Thread.cc: In function
'void Thread::create(size_t)' thread 7fcf768c9760 time 2014-09-12
15:00:28.284735 common/Thread.cc: 110: FAILED assert(ret == 0)
   ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
   1: /usr/bin/ceph-conf() [0x51de8f]
   2: (CephContext::CephContext(unsigned int)+0xb1) [0x520fe1]
   3: (common_preinit(CephInitParameters const&, code_environment_t,
int)+0x48) [0x52eb78]
   4: (global_pre_init(std::vector<char const*, std::allocator<char
const*> >*, std::vector<char const*, std::allocator<char const*> >&,
unsigned int, code_environment_t, int)+0x8d) [0x518d0d]
   5: (main()+0x17a) [0x514f6a]
   6: (__libc_start_main()+0xfd) [0x7fcf7522ceed]
   7: /usr/bin/ceph-conf() [0x5168d1]
   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
Aborted (core dumped)
/etc/init.d/ceph: 340: /etc/init.d/ceph: Cannot fork
/etc/init.d/ceph: 1: /etc/init.d/ceph: Cannot fork
Traceback (most recent call last):
    File "/usr/bin/ceph", line 830, in <module>
      sys.exit(main())
    File "/usr/bin/ceph", line 590, in main
      conffile=conffile)
    File "/usr/lib/python2.7/dist-packages/rados.py", line 198, in
__init__ librados_path = find_library('rados')
    File "/usr/lib/python2.7/ctypes/util.py", line 224, in find_library
      return _findSoname_ldconfig(name) or
_get_soname(_findLib_gcc(name)) File
"/usr/lib/python2.7/ctypes/util.py", line 213, in _findSoname_ldconfig
      f = os.popen('/sbin/ldconfig -p 2>/dev/null')
OSError: [Errno 12] Cannot allocate memory

But anyways, when I look at the memory consumption of the system:
# free -m
               total       used       free     shared    buffers
cached Mem:        258450      25841     232609          0
18      15506 -/+ buffers/cache:      10315     248135
Swap:         3811          0       3811


There are more then 230GB of memory available! What is going on there?
System:
Linux ceph-osd-bs04 3.14-0.bpo.1-amd64 #1 SMP Debian 3.14.12-1~bpo70+1
(2014-07-13) x86_64 GNU/Linux

Since this is happening on other Hardware as well, I don't think it's
Hardware related. I have no Idea if this is an OS issue (which would
be seriously strange) or a ceph issue.

Since this is happening only AFTER we upgraded to firefly, I guess it
has something to do with ceph.

ANY idea on what is going on here would be very appreciated!

Regards,
Christian
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczew...@efigence.com
<mailto:mariusz.gronczew...@efigence.com>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSDs are crashing with "Cannot fork" or "cannot create thread" but plenty of memory is left

Reply via email to