Re: [ceph-users] Expanding pg's of an erasure coded pool
- Message from Guang Yang - Date: Fri, 30 May 2014 08:56:37 +0800 From: Guang Yang Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool To: Gregory Farnum Cc: Kenneth Waegeman , ceph-users On May 28, 2014, at 5:31 AM, Gregory Farnum wrote: On Sun, May 25, 2014 at 6:24 PM, Guang Yang wrote: On May 21, 2014, at 1:33 AM, Gregory Farnum wrote: This failure means the messenger subsystem is trying to create a thread and is getting an error code back ? probably due to a process or system thread limit that you can turn up with ulimit. This is happening because a replicated PG primary needs a connection to only its replicas (generally 1 or 2 connections), but with an erasure-coded PG the primary requires a connection to m+n-1 replicas (everybody who's in the erasure-coding set, including itself). Right now our messenger requires a thread for each connection, so kerblam. (And it actually requires a couple such connections because we have separate heartbeat, cluster data, and client data systems.) Hi Greg, Is there any plan to refactor the messenger component to reduce the num of threads? For example, use event-driven mode. We've discussed it in very broad terms, but there are no concrete designs and it's not on the schedule yet. If anybody has conclusive evidence that it's causing them trouble they can't work around, that would be good to know? Well, we weren't able to find the source of the problem we had (see original message) but there was no more time to test it further. So something (memory or thread related?) was causing bad troubles on the large setup what didn't happen on our small setup, but at this moment I have no more information than included in the mails of this thread.. Thanks for the response! We used to have a cluster with each OSD host having 11 disks (daemons), on each host, there are around 15K threads, the system is stable but when there is cluster wide change (e.g. OSD down / out, recovery), we observed system load increasing, there is no cascading failure though. Most recently we are evaluating Ceph against high density hardware with each OSD host having 33 disks (daemons), on each host, there are around 40K-50K threads, with some OSD host down/out, we started seeing high load increasing and a large volume of thread join/creation. We don?t have a strong evidence that the messenger thread model is the problem and how event-driven approach can help, but I think as moving to high density hardware (for cost saving purpose), the issue could be amplified. If there is any plan, it is good to know and we are very interested to involve. Thanks, Guang -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com - End message from Guang Yang - -- Met vriendelijke groeten, Kenneth Waegeman ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding pg's of an erasure coded pool
On May 28, 2014, at 5:31 AM, Gregory Farnum wrote: > On Sun, May 25, 2014 at 6:24 PM, Guang Yang wrote: >> On May 21, 2014, at 1:33 AM, Gregory Farnum wrote: >> >>> This failure means the messenger subsystem is trying to create a >>> thread and is getting an error code back — probably due to a process >>> or system thread limit that you can turn up with ulimit. >>> >>> This is happening because a replicated PG primary needs a connection >>> to only its replicas (generally 1 or 2 connections), but with an >>> erasure-coded PG the primary requires a connection to m+n-1 replicas >>> (everybody who's in the erasure-coding set, including itself). Right >>> now our messenger requires a thread for each connection, so kerblam. >>> (And it actually requires a couple such connections because we have >>> separate heartbeat, cluster data, and client data systems.) >> Hi Greg, >> Is there any plan to refactor the messenger component to reduce the num of >> threads? For example, use event-driven mode. > > We've discussed it in very broad terms, but there are no concrete > designs and it's not on the schedule yet. If anybody has conclusive > evidence that it's causing them trouble they can't work around, that > would be good to know… Thanks for the response! We used to have a cluster with each OSD host having 11 disks (daemons), on each host, there are around 15K threads, the system is stable but when there is cluster wide change (e.g. OSD down / out, recovery), we observed system load increasing, there is no cascading failure though. Most recently we are evaluating Ceph against high density hardware with each OSD host having 33 disks (daemons), on each host, there are around 40K-50K threads, with some OSD host down/out, we started seeing high load increasing and a large volume of thread join/creation. We don’t have a strong evidence that the messenger thread model is the problem and how event-driven approach can help, but I think as moving to high density hardware (for cost saving purpose), the issue could be amplified. If there is any plan, it is good to know and we are very interested to involve. Thanks, Guang > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding pg's of an erasure coded pool
On Sun, May 25, 2014 at 6:24 PM, Guang Yang wrote: > On May 21, 2014, at 1:33 AM, Gregory Farnum wrote: > >> This failure means the messenger subsystem is trying to create a >> thread and is getting an error code back — probably due to a process >> or system thread limit that you can turn up with ulimit. >> >> This is happening because a replicated PG primary needs a connection >> to only its replicas (generally 1 or 2 connections), but with an >> erasure-coded PG the primary requires a connection to m+n-1 replicas >> (everybody who's in the erasure-coding set, including itself). Right >> now our messenger requires a thread for each connection, so kerblam. >> (And it actually requires a couple such connections because we have >> separate heartbeat, cluster data, and client data systems.) > Hi Greg, > Is there any plan to refactor the messenger component to reduce the num of > threads? For example, use event-driven mode. We've discussed it in very broad terms, but there are no concrete designs and it's not on the schedule yet. If anybody has conclusive evidence that it's causing them trouble they can't work around, that would be good to know... -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding pg's of an erasure coded pool
On May 21, 2014, at 1:33 AM, Gregory Farnum wrote: > This failure means the messenger subsystem is trying to create a > thread and is getting an error code back — probably due to a process > or system thread limit that you can turn up with ulimit. > > This is happening because a replicated PG primary needs a connection > to only its replicas (generally 1 or 2 connections), but with an > erasure-coded PG the primary requires a connection to m+n-1 replicas > (everybody who's in the erasure-coding set, including itself). Right > now our messenger requires a thread for each connection, so kerblam. > (And it actually requires a couple such connections because we have > separate heartbeat, cluster data, and client data systems.) Hi Greg, Is there any plan to refactor the messenger component to reduce the num of threads? For example, use event-driven mode. Thanks, Guang > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman > wrote: >> Hi, >> >> On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to >> create a erasure coded pool with 4096 pgs, but this crashed the cluster. >> I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I >> then try to expand to 4096 (not even quite enough) the cluster crashes >> again. ( Do we need less of pg's with erasure coding?) >> >> The crash starts with individual OSDs crashing, eventually bringing down the >> mons (until there is no more quorum or too few osds) >> >> Out of the logs: >> >> >> -16> 2014-05-20 10:31:55.545590 7fd42f34d700 5 -- op tracker -- , seq: >> 14301, time: 2014-05-20 10:31:55.545590, event: started, request: >> pg_query(0.974 epoch 3315) v3 >> -15> 2014-05-20 10:31:55.545776 7fd42f34d700 1 -- >> 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974 >> epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9 >> 026b40 >> -14> 2014-05-20 10:31:55.545807 7fd42f34d700 5 -- op tracker -- , seq: >> 14301, time: 2014-05-20 10:31:55.545807, event: done, request: >> pg_query(0.974 epoch 3315) v3 >> -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700 1 -- >> 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0 >> l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0 >> -12> 2014-05-20 10:31:55.564034 7fd3bf72f700 1 -- >> 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0 >> l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0 >> -11> 2014-05-20 10:31:55.627776 7fd42df4b700 1 -- >> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 >> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2 47+0+0 >> (855262282 0 0) 0xb6863c0 con 0x1255b9c0 >> -10> 2014-05-20 10:31:55.629425 7fd42df4b700 1 -- >> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 >> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2 47+0+0 >> (2581193378 0 0) 0x93d6c80 con 0x1255b9c0 >>-9> 2014-05-20 10:31:55.631270 7fd42f34d700 1 -- >> 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 >> pg_query(7.3ffs6 epoch 3326) v3 144+0+0 (221596234 0 0) 0x10b994a0 con >> 0x9383860 >>-8> 2014-05-20 10:31:55.631308 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-7> 2014-05-20 10:31:55.631315 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-6> 2014-05-20 10:31:55.631339 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-5> 2014-05-20 10:31:55.631343 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-4> 2014-05-20 10:31:55.631349 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-3> 2014-05-20 10:31:55.631363 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631363, event: started, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-2> 2014-05-20 10:31:55.631402 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631402, event: done, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-1> 2014-05-20 10:31:55.631488 7fd427b41700 1 -- >> 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 -- >> pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860 >> 0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In >> function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20 >> 10:31:55.630937 >> common/Thread.cc: 110: FAILED assert(ret == 0) >> >> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) >>
Re: [ceph-users] Expanding pg's of an erasure coded pool
On 2014.05.22 19:55, Gregory Farnum wrote: > On Thu, May 22, 2014 at 4:09 AM, Kenneth Waegeman > wrote: >> - Message from Gregory Farnum - >>Date: Wed, 21 May 2014 15:46:17 -0700 >> >>From: Gregory Farnum >> Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool >> To: Kenneth Waegeman >> Cc: ceph-users >> >> >>> On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman >>> wrote: >>>> Thanks! I increased the max processes parameter for all daemons quite a >>>> lot >>>> (until ulimit -u 3802720) >>>> >>>> These are the limits for the daemons now.. >>>> [root@ ~]# cat /proc/17006/limits >>>> Limit Soft Limit Hard Limit Units >>>> Max cpu time unlimitedunlimited >>>> seconds >>>> Max file size unlimitedunlimitedbytes >>>> Max data size unlimitedunlimitedbytes >>>> Max stack size10485760 unlimitedbytes >>>> Max core file sizeunlimitedunlimitedbytes >>>> Max resident set unlimitedunlimitedbytes >>>> Max processes 3802720 3802720 >>>> processes >>>> Max open files3276832768files >>>> Max locked memory 6553665536bytes >>>> Max address space unlimitedunlimitedbytes >>>> Max file locksunlimitedunlimitedlocks >>>> Max pending signals 9506895068 >>>> signals >>>> Max msgqueue size 819200 819200 bytes >>>> Max nice priority 00 >>>> Max realtime priority 00 >>>> Max realtime timeout unlimitedunlimitedus >>>> >>>> But this didn't help. Are there other parameters I should change? >>> >>> Hrm, is it exactly the same stack trace? You might need to bump the >>> open files limit as well, although I'd be surprised. :/ >> >> I increased the open file limit as test to 128000, still the same results. >> >> Stack trace: > > >> But I see some things happening on the system while doing this too: >> >> >> >> [root@ ~]# ceph osd pool set ecdata15 pgp_num 4096 >> set pool 16 pgp_num to 4096 >> [root@ ~]# ceph status >> Traceback (most recent call last): >> File "/usr/bin/ceph", line 830, in >> sys.exit(main()) >> File "/usr/bin/ceph", line 590, in main >> conffile=conffile) >> File "/usr/lib/python2.6/site-packages/rados.py", line 198, in __init__ >> librados_path = find_library('rados') >> File "/usr/lib64/python2.6/ctypes/util.py", line 209, in find_library >> return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name)) >> File "/usr/lib64/python2.6/ctypes/util.py", line 203, in >> _findSoname_ldconfig >> os.popen('LANG=C /sbin/ldconfig -p 2>/dev/null').read()) >> OSError: [Errno 12] Cannot allocate memory >> [root@ ~]# lsof | wc >> -bash: fork: Cannot allocate memory >> [root@ ~]# lsof | wc >> 21801 211209 3230028 >> [root@ ~]# ceph status >> ^CError connecting to cluster: InterruptedOrTimeoutError >> ^[[A[root@ ~]# lsof | wc >>2028 17476 190947 >> >> >> >> And meanwhile the daemons has then been crashed. >> >> I verified the memory never ran out. > Is there anything in dmesg? It sure looks like the OS thinks it's run > out of memory one way or another. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com May it be related to memory fragmentation? http://dom.as/2014/01/17/on-swapping-and-kernels/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding pg's of an erasure coded pool
On Thu, May 22, 2014 at 4:09 AM, Kenneth Waegeman wrote: > > - Message from Gregory Farnum - >Date: Wed, 21 May 2014 15:46:17 -0700 > >From: Gregory Farnum > Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool > To: Kenneth Waegeman > Cc: ceph-users > > >> On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman >> wrote: >>> >>> Thanks! I increased the max processes parameter for all daemons quite a >>> lot >>> (until ulimit -u 3802720) >>> >>> These are the limits for the daemons now.. >>> [root@ ~]# cat /proc/17006/limits >>> Limit Soft Limit Hard Limit Units >>> Max cpu time unlimitedunlimited >>> seconds >>> Max file size unlimitedunlimitedbytes >>> Max data size unlimitedunlimitedbytes >>> Max stack size10485760 unlimitedbytes >>> Max core file sizeunlimitedunlimitedbytes >>> Max resident set unlimitedunlimitedbytes >>> Max processes 3802720 3802720 >>> processes >>> Max open files3276832768files >>> Max locked memory 6553665536bytes >>> Max address space unlimitedunlimitedbytes >>> Max file locksunlimitedunlimitedlocks >>> Max pending signals 9506895068 >>> signals >>> Max msgqueue size 819200 819200 bytes >>> Max nice priority 00 >>> Max realtime priority 00 >>> Max realtime timeout unlimitedunlimitedus >>> >>> But this didn't help. Are there other parameters I should change? >> >> >> Hrm, is it exactly the same stack trace? You might need to bump the >> open files limit as well, although I'd be surprised. :/ > > > I increased the open file limit as test to 128000, still the same results. > > Stack trace: > But I see some things happening on the system while doing this too: > > > > [root@ ~]# ceph osd pool set ecdata15 pgp_num 4096 > set pool 16 pgp_num to 4096 > [root@ ~]# ceph status > Traceback (most recent call last): > File "/usr/bin/ceph", line 830, in > sys.exit(main()) > File "/usr/bin/ceph", line 590, in main > conffile=conffile) > File "/usr/lib/python2.6/site-packages/rados.py", line 198, in __init__ > librados_path = find_library('rados') > File "/usr/lib64/python2.6/ctypes/util.py", line 209, in find_library > return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name)) > File "/usr/lib64/python2.6/ctypes/util.py", line 203, in > _findSoname_ldconfig > os.popen('LANG=C /sbin/ldconfig -p 2>/dev/null').read()) > OSError: [Errno 12] Cannot allocate memory > [root@ ~]# lsof | wc > -bash: fork: Cannot allocate memory > [root@ ~]# lsof | wc > 21801 211209 3230028 > [root@ ~]# ceph status > ^CError connecting to cluster: InterruptedOrTimeoutError > ^[[A[root@ ~]# lsof | wc >2028 17476 190947 > > > > And meanwhile the daemons has then been crashed. > > I verified the memory never ran out. Is there anything in dmesg? It sure looks like the OS thinks it's run out of memory one way or another. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding pg's of an erasure coded pool
- Message from Gregory Farnum - Date: Wed, 21 May 2014 15:46:17 -0700 From: Gregory Farnum Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool To: Kenneth Waegeman Cc: ceph-users On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman wrote: Thanks! I increased the max processes parameter for all daemons quite a lot (until ulimit -u 3802720) These are the limits for the daemons now.. [root@ ~]# cat /proc/17006/limits Limit Soft Limit Hard Limit Units Max cpu time unlimitedunlimitedseconds Max file size unlimitedunlimitedbytes Max data size unlimitedunlimitedbytes Max stack size10485760 unlimitedbytes Max core file sizeunlimitedunlimitedbytes Max resident set unlimitedunlimitedbytes Max processes 3802720 3802720 processes Max open files3276832768files Max locked memory 6553665536bytes Max address space unlimitedunlimitedbytes Max file locksunlimitedunlimitedlocks Max pending signals 9506895068signals Max msgqueue size 819200 819200 bytes Max nice priority 00 Max realtime priority 00 Max realtime timeout unlimitedunlimitedus But this didn't help. Are there other parameters I should change? Hrm, is it exactly the same stack trace? You might need to bump the open files limit as well, although I'd be surprised. :/ I increased the open file limit as test to 128000, still the same results. Stack trace: -16> 2014-05-22 11:10:05.262456 7f3bfcaee700 5 osd.398 pg_epoch: 6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326 6293/6310/6293) [255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14 lpr=6310 pi=6293-6309/1 crt=0'0 active] exit Started/ReplicaActive/RepNotRecovering 52.314752 4 0.000408 -15> 2014-05-22 11:10:05.262649 7f3bfcaee700 5 osd.398 pg_epoch: 6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326 6293/6310/6293) [255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14 lpr=6310 pi=6293-6309/1 crt=0'0 active] exit Started/ReplicaActive 52.315020 0 0.00 -14> 2014-05-22 11:10:05.262667 7f3bfcaee700 5 osd.398 pg_epoch: 6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326 6293/6310/6293) [255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14 lpr=6310 pi=6293-6309/1 crt=0'0 active] exit Started 55.181842 0 0.00 -13> 2014-05-22 11:10:05.262681 7f3bfcaee700 5 osd.398 pg_epoch: 6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326 6293/6310/6293) [255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14 lpr=6310 pi=6293-6309/1 crt=0'0 active] enter Reset -12> 2014-05-22 11:10:05.262797 7f3bfcaee700 5 osd.398 pg_epoch: 6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326 6327/6327/6327) [200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1 lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] exit Reset 0.000117 1 0.000338 -11> 2014-05-22 11:10:05.262956 7f3bfcaee700 5 osd.398 pg_epoch: 6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326 6327/6327/6327) [200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1 lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] enter Started -10> 2014-05-22 11:10:05.262983 7f3bfcaee700 5 osd.398 pg_epoch: 6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326 6327/6327/6327) [200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1 lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] enter Start -9> 2014-05-22 11:10:05.262994 7f3bfcaee700 1 osd.398 pg_epoch: 6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326 6327/6327/6327) [200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1 lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] state: transitioning to Stray -8> 2014-05-22 11:10:05.263151 7f3bfcaee700 5 osd.398 pg_epoch: 6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326 6327/6327/6327) [200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1 lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] exit Start 0.000169 0 0.00 -7> 2014-05-22 11:10:05.263385 7f3bfcaee700 5 osd.398 pg_epoch: 6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326 6327/6327/6327) [200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1 lpr
Re: [ceph-users] Expanding pg's of an erasure coded pool
On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman wrote: > Thanks! I increased the max processes parameter for all daemons quite a lot > (until ulimit -u 3802720) > > These are the limits for the daemons now.. > [root@ ~]# cat /proc/17006/limits > Limit Soft Limit Hard Limit Units > Max cpu time unlimitedunlimitedseconds > Max file size unlimitedunlimitedbytes > Max data size unlimitedunlimitedbytes > Max stack size10485760 unlimitedbytes > Max core file sizeunlimitedunlimitedbytes > Max resident set unlimitedunlimitedbytes > Max processes 3802720 3802720 > processes > Max open files3276832768files > Max locked memory 6553665536bytes > Max address space unlimitedunlimitedbytes > Max file locksunlimitedunlimitedlocks > Max pending signals 9506895068signals > Max msgqueue size 819200 819200 bytes > Max nice priority 00 > Max realtime priority 00 > Max realtime timeout unlimitedunlimitedus > > But this didn't help. Are there other parameters I should change? Hrm, is it exactly the same stack trace? You might need to bump the open files limit as well, although I'd be surprised. :/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding pg's of an erasure coded pool
Thanks! I increased the max processes parameter for all daemons quite a lot (until ulimit -u 3802720) These are the limits for the daemons now.. [root@ ~]# cat /proc/17006/limits Limit Soft Limit Hard Limit Units Max cpu time unlimitedunlimitedseconds Max file size unlimitedunlimitedbytes Max data size unlimitedunlimitedbytes Max stack size10485760 unlimitedbytes Max core file sizeunlimitedunlimitedbytes Max resident set unlimitedunlimitedbytes Max processes 3802720 3802720 processes Max open files3276832768files Max locked memory 6553665536bytes Max address space unlimitedunlimitedbytes Max file locksunlimitedunlimitedlocks Max pending signals 9506895068signals Max msgqueue size 819200 819200 bytes Max nice priority 00 Max realtime priority 00 Max realtime timeout unlimitedunlimitedus But this didn't help. Are there other parameters I should change? I also got an 'bash: fork: Cannot allocate memory' error once when running a command after starting the ceph services. It shouldn't be a memory shortage issue itself because when monitoring the failure there is still enough (cached) available.. - Message from Gregory Farnum - Date: Tue, 20 May 2014 10:33:30 -0700 From: Gregory Farnum Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool To: Kenneth Waegeman Cc: ceph-users This failure means the messenger subsystem is trying to create a thread and is getting an error code back — probably due to a process or system thread limit that you can turn up with ulimit. This is happening because a replicated PG primary needs a connection to only its replicas (generally 1 or 2 connections), but with an erasure-coded PG the primary requires a connection to m+n-1 replicas (everybody who's in the erasure-coding set, including itself). Right now our messenger requires a thread for each connection, so kerblam. (And it actually requires a couple such connections because we have separate heartbeat, cluster data, and client data systems.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman wrote: Hi, On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to create a erasure coded pool with 4096 pgs, but this crashed the cluster. I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I then try to expand to 4096 (not even quite enough) the cluster crashes again. ( Do we need less of pg's with erasure coding?) The crash starts with individual OSDs crashing, eventually bringing down the mons (until there is no more quorum or too few osds) Out of the logs: -16> 2014-05-20 10:31:55.545590 7fd42f34d700 5 -- op tracker -- , seq: 14301, time: 2014-05-20 10:31:55.545590, event: started, request: pg_query(0.974 epoch 3315) v3 -15> 2014-05-20 10:31:55.545776 7fd42f34d700 1 -- 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974 epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9 026b40 -14> 2014-05-20 10:31:55.545807 7fd42f34d700 5 -- op tracker -- , seq: 14301, time: 2014-05-20 10:31:55.545807, event: done, request: pg_query(0.974 epoch 3315) v3 -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700 1 -- 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0 l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0 -12> 2014-05-20 10:31:55.564034 7fd3bf72f700 1 -- 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0 l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0 -11> 2014-05-20 10:31:55.627776 7fd42df4b700 1 -- 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2 47+0+0 (855262282 0 0) 0xb6863c0 con 0x1255b9c0 -10> 2014-05-20 10:31:55.629425 7fd42df4b700 1 -- 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2 47+0+0 (2581193378 0 0) 0x93d6c80 con 0x1255b9c0 -9> 2014-05-20 10:31:55.631270 7fd42f34d700 1 -- 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 pg_query(7.3ffs6 epoch 3326) v3 144+0+0 (221596234 0 0) 0x10b994a0 con 0x9383860 -8> 2014-05-20 10:31:55.631308 7fd42f34d700 5 -- op tracker -- , seq: 14302,
Re: [ceph-users] Expanding pg's of an erasure coded pool
This failure means the messenger subsystem is trying to create a thread and is getting an error code back — probably due to a process or system thread limit that you can turn up with ulimit. This is happening because a replicated PG primary needs a connection to only its replicas (generally 1 or 2 connections), but with an erasure-coded PG the primary requires a connection to m+n-1 replicas (everybody who's in the erasure-coding set, including itself). Right now our messenger requires a thread for each connection, so kerblam. (And it actually requires a couple such connections because we have separate heartbeat, cluster data, and client data systems.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman wrote: > Hi, > > On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to > create a erasure coded pool with 4096 pgs, but this crashed the cluster. > I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I > then try to expand to 4096 (not even quite enough) the cluster crashes > again. ( Do we need less of pg's with erasure coding?) > > The crash starts with individual OSDs crashing, eventually bringing down the > mons (until there is no more quorum or too few osds) > > Out of the logs: > > >-16> 2014-05-20 10:31:55.545590 7fd42f34d700 5 -- op tracker -- , seq: > 14301, time: 2014-05-20 10:31:55.545590, event: started, request: > pg_query(0.974 epoch 3315) v3 >-15> 2014-05-20 10:31:55.545776 7fd42f34d700 1 -- > 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974 > epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9 > 026b40 >-14> 2014-05-20 10:31:55.545807 7fd42f34d700 5 -- op tracker -- , seq: > 14301, time: 2014-05-20 10:31:55.545807, event: done, request: > pg_query(0.974 epoch 3315) v3 >-13> 2014-05-20 10:31:55.559661 7fd3fdb0f700 1 -- > 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0 > l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0 >-12> 2014-05-20 10:31:55.564034 7fd3bf72f700 1 -- > 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0 > l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0 >-11> 2014-05-20 10:31:55.627776 7fd42df4b700 1 -- > 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 > osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2 47+0+0 > (855262282 0 0) 0xb6863c0 con 0x1255b9c0 >-10> 2014-05-20 10:31:55.629425 7fd42df4b700 1 -- > 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 > osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2 47+0+0 > (2581193378 0 0) 0x93d6c80 con 0x1255b9c0 > -9> 2014-05-20 10:31:55.631270 7fd42f34d700 1 -- > 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 > pg_query(7.3ffs6 epoch 3326) v3 144+0+0 (221596234 0 0) 0x10b994a0 con > 0x9383860 > -8> 2014-05-20 10:31:55.631308 7fd42f34d700 5 -- op tracker -- , seq: > 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request: > pg_query(7.3ffs6 epoch 3326) v3 > -7> 2014-05-20 10:31:55.631315 7fd42f34d700 5 -- op tracker -- , seq: > 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request: > pg_query(7.3ffs6 epoch 3326) v3 > -6> 2014-05-20 10:31:55.631339 7fd42f34d700 5 -- op tracker -- , seq: > 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request: > pg_query(7.3ffs6 epoch 3326) v3 > -5> 2014-05-20 10:31:55.631343 7fd42f34d700 5 -- op tracker -- , seq: > 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request: > pg_query(7.3ffs6 epoch 3326) v3 > -4> 2014-05-20 10:31:55.631349 7fd42f34d700 5 -- op tracker -- , seq: > 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request: > pg_query(7.3ffs6 epoch 3326) v3 > -3> 2014-05-20 10:31:55.631363 7fd42f34d700 5 -- op tracker -- , seq: > 14302, time: 2014-05-20 10:31:55.631363, event: started, request: > pg_query(7.3ffs6 epoch 3326) v3 > -2> 2014-05-20 10:31:55.631402 7fd42f34d700 5 -- op tracker -- , seq: > 14302, time: 2014-05-20 10:31:55.631402, event: done, request: > pg_query(7.3ffs6 epoch 3326) v3 > -1> 2014-05-20 10:31:55.631488 7fd427b41700 1 -- > 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 -- > pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860 > 0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In > function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20 > 10:31:55.630937 > common/Thread.cc: 110: FAILED assert(ret == 0) > > ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) > 1: (Thread::create(unsigned long)+0x8a) [0xa83f8a] > 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa] > 3: (Accepter::entry()+0x265) [0xb3ca45] > 4: (()+0x79d1) [0x7fd4436b19d1] > 5: (clone()+0x6d) [0x7fd4423ecb6d] > > --- begin dump of recent events --- > 0> 2014-05-20 1
[ceph-users] Expanding pg's of an erasure coded pool
Hi, On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to create a erasure coded pool with 4096 pgs, but this crashed the cluster. I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I then try to expand to 4096 (not even quite enough) the cluster crashes again. ( Do we need less of pg's with erasure coding?) The crash starts with individual OSDs crashing, eventually bringing down the mons (until there is no more quorum or too few osds) Out of the logs: -16> 2014-05-20 10:31:55.545590 7fd42f34d700 5 -- op tracker -- , seq: 14301, time: 2014-05-20 10:31:55.545590, event: started, request: pg_query(0.974 epoch 3315) v3 -15> 2014-05-20 10:31:55.545776 7fd42f34d700 1 -- 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974 epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9 026b40 -14> 2014-05-20 10:31:55.545807 7fd42f34d700 5 -- op tracker -- , seq: 14301, time: 2014-05-20 10:31:55.545807, event: done, request: pg_query(0.974 epoch 3315) v3 -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700 1 -- 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0 l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0 -12> 2014-05-20 10:31:55.564034 7fd3bf72f700 1 -- 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0 l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0 -11> 2014-05-20 10:31:55.627776 7fd42df4b700 1 -- 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2 47+0+0 (855262282 0 0) 0xb6863c0 con 0x1255b9c0 -10> 2014-05-20 10:31:55.629425 7fd42df4b700 1 -- 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2 47+0+0 (2581193378 0 0) 0x93d6c80 con 0x1255b9c0 -9> 2014-05-20 10:31:55.631270 7fd42f34d700 1 -- 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 pg_query(7.3ffs6 epoch 3326) v3 144+0+0 (221596234 0 0) 0x10b994a0 con 0x9383860 -8> 2014-05-20 10:31:55.631308 7fd42f34d700 5 -- op tracker -- , seq: 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request: pg_query(7.3ffs6 epoch 3326) v3 -7> 2014-05-20 10:31:55.631315 7fd42f34d700 5 -- op tracker -- , seq: 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request: pg_query(7.3ffs6 epoch 3326) v3 -6> 2014-05-20 10:31:55.631339 7fd42f34d700 5 -- op tracker -- , seq: 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request: pg_query(7.3ffs6 epoch 3326) v3 -5> 2014-05-20 10:31:55.631343 7fd42f34d700 5 -- op tracker -- , seq: 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request: pg_query(7.3ffs6 epoch 3326) v3 -4> 2014-05-20 10:31:55.631349 7fd42f34d700 5 -- op tracker -- , seq: 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request: pg_query(7.3ffs6 epoch 3326) v3 -3> 2014-05-20 10:31:55.631363 7fd42f34d700 5 -- op tracker -- , seq: 14302, time: 2014-05-20 10:31:55.631363, event: started, request: pg_query(7.3ffs6 epoch 3326) v3 -2> 2014-05-20 10:31:55.631402 7fd42f34d700 5 -- op tracker -- , seq: 14302, time: 2014-05-20 10:31:55.631402, event: done, request: pg_query(7.3ffs6 epoch 3326) v3 -1> 2014-05-20 10:31:55.631488 7fd427b41700 1 -- 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 -- pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860 0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20 10:31:55.630937 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: (Thread::create(unsigned long)+0x8a) [0xa83f8a] 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa] 3: (Accepter::entry()+0x265) [0xb3ca45] 4: (()+0x79d1) [0x7fd4436b19d1] 5: (clone()+0x6d) [0x7fd4423ecb6d] --- begin dump of recent events --- 0> 2014-05-20 10:31:56.622247 7fd3bc5fe700 -1 *** Caught signal (Aborted) ** in thread 7fd3bc5fe700 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: /usr/bin/ceph-osd() [0x9ab3b1] 2: (()+0xf710) [0x7fd4436b9710] 3: (gsignal()+0x35) [0x7fd442336925] 4: (abort()+0x175) [0x7fd442338105] 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fd442bf0a5d] 6: (()+0xbcbe6) [0x7fd442beebe6] 7: (()+0xbcc13) [0x7fd442beec13] 8: (()+0xbcd0e) [0x7fd442beed0e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f2) [0xaec612] 10: (Thread::create(unsigned long)+0x8a) [0xa83f8a] 11: (Pipe::connect()+0x2efb) [0xb2850b] 12: (Pipe::writer()+0x9f3) [0xb2a063] 13: (Pipe::Writer::entry()+0xd) [0xb359cd] 14: (()+0x79d1) [0x7fd4436b19d1] 15: (clone()+0x6d) [0x7fd4423ecb6d] NOTE: a copy of the executable, or `obj