Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-06-02 Thread Kenneth Waegeman


- Message from Guang Yang  -
   Date: Fri, 30 May 2014 08:56:37 +0800
   From: Guang Yang 
Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool
 To: Gregory Farnum 
 Cc: Kenneth Waegeman , ceph-users  





On May 28, 2014, at 5:31 AM, Gregory Farnum  wrote:


On Sun, May 25, 2014 at 6:24 PM, Guang Yang  wrote:

On May 21, 2014, at 1:33 AM, Gregory Farnum  wrote:


This failure means the messenger subsystem is trying to create a
thread and is getting an error code back ? probably due to a process
or system thread limit that you can turn up with ulimit.

This is happening because a replicated PG primary needs a connection
to only its replicas (generally 1 or 2 connections), but with an
erasure-coded PG the primary requires a connection to m+n-1 replicas
(everybody who's in the erasure-coding set, including itself). Right
now our messenger requires a thread for each connection, so kerblam.
(And it actually requires a couple such connections because we have
separate heartbeat, cluster data, and client data systems.)

Hi Greg,
Is there any plan to refactor the messenger component to reduce  
the num of threads? For example, use event-driven mode.


We've discussed it in very broad terms, but there are no concrete
designs and it's not on the schedule yet. If anybody has conclusive
evidence that it's causing them trouble they can't work around, that
would be good to know?


Well, we weren't able to find the source of the problem we had (see  
original message) but there was no more time to test it further. So  
something (memory or thread related?) was causing bad troubles on the  
large setup what didn't happen on our small setup, but at this moment  
I have no more information than included in the mails of this thread..



Thanks for the response!

We used to have a cluster with each OSD host having 11 disks  
(daemons), on each host, there are around 15K threads, the system is  
stable but when there is cluster wide change (e.g. OSD down / out,  
recovery), we observed system load increasing, there is no cascading  
failure though.


Most recently we are evaluating Ceph against high density hardware  
with each OSD host having 33 disks (daemons), on each host, there  
are around 40K-50K threads, with some OSD host down/out, we started  
seeing high load increasing and a large volume of thread  
join/creation.


We don?t have a strong evidence that the messenger thread model is  
the problem and how event-driven approach can help, but I think as  
moving to high density hardware (for cost saving purpose), the issue  
could be amplified.


If there is any plan, it is good to know and we are very interested  
to involve.


Thanks,
Guang


-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com



- End message from Guang Yang  -

--

Met vriendelijke groeten,
Kenneth Waegeman


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-29 Thread Guang Yang
On May 28, 2014, at 5:31 AM, Gregory Farnum  wrote:

> On Sun, May 25, 2014 at 6:24 PM, Guang Yang  wrote:
>> On May 21, 2014, at 1:33 AM, Gregory Farnum  wrote:
>> 
>>> This failure means the messenger subsystem is trying to create a
>>> thread and is getting an error code back — probably due to a process
>>> or system thread limit that you can turn up with ulimit.
>>> 
>>> This is happening because a replicated PG primary needs a connection
>>> to only its replicas (generally 1 or 2 connections), but with an
>>> erasure-coded PG the primary requires a connection to m+n-1 replicas
>>> (everybody who's in the erasure-coding set, including itself). Right
>>> now our messenger requires a thread for each connection, so kerblam.
>>> (And it actually requires a couple such connections because we have
>>> separate heartbeat, cluster data, and client data systems.)
>> Hi Greg,
>> Is there any plan to refactor the messenger component to reduce the num of 
>> threads? For example, use event-driven mode.
> 
> We've discussed it in very broad terms, but there are no concrete
> designs and it's not on the schedule yet. If anybody has conclusive
> evidence that it's causing them trouble they can't work around, that
> would be good to know…
Thanks for the response!

We used to have a cluster with each OSD host having 11 disks (daemons), on each 
host, there are around 15K threads, the system is stable but when there is 
cluster wide change (e.g. OSD down / out, recovery), we observed system load 
increasing, there is no cascading failure though.

Most recently we are evaluating Ceph against high density hardware with each 
OSD host having 33 disks (daemons), on each host, there are around 40K-50K 
threads, with some OSD host down/out, we started seeing high load increasing 
and a large volume of thread join/creation.

We don’t have a strong evidence that the messenger thread model is the problem 
and how event-driven approach can help, but I think as moving to high density 
hardware (for cost saving purpose), the issue could be amplified.

If there is any plan, it is good to know and we are very interested to involve.

Thanks,
Guang

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-27 Thread Gregory Farnum
On Sun, May 25, 2014 at 6:24 PM, Guang Yang  wrote:
> On May 21, 2014, at 1:33 AM, Gregory Farnum  wrote:
>
>> This failure means the messenger subsystem is trying to create a
>> thread and is getting an error code back — probably due to a process
>> or system thread limit that you can turn up with ulimit.
>>
>> This is happening because a replicated PG primary needs a connection
>> to only its replicas (generally 1 or 2 connections), but with an
>> erasure-coded PG the primary requires a connection to m+n-1 replicas
>> (everybody who's in the erasure-coding set, including itself). Right
>> now our messenger requires a thread for each connection, so kerblam.
>> (And it actually requires a couple such connections because we have
>> separate heartbeat, cluster data, and client data systems.)
> Hi Greg,
> Is there any plan to refactor the messenger component to reduce the num of 
> threads? For example, use event-driven mode.

We've discussed it in very broad terms, but there are no concrete
designs and it's not on the schedule yet. If anybody has conclusive
evidence that it's causing them trouble they can't work around, that
would be good to know...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-25 Thread Guang Yang
On May 21, 2014, at 1:33 AM, Gregory Farnum  wrote:

> This failure means the messenger subsystem is trying to create a
> thread and is getting an error code back — probably due to a process
> or system thread limit that you can turn up with ulimit.
> 
> This is happening because a replicated PG primary needs a connection
> to only its replicas (generally 1 or 2 connections), but with an
> erasure-coded PG the primary requires a connection to m+n-1 replicas
> (everybody who's in the erasure-coding set, including itself). Right
> now our messenger requires a thread for each connection, so kerblam.
> (And it actually requires a couple such connections because we have
> separate heartbeat, cluster data, and client data systems.)
Hi Greg,
Is there any plan to refactor the messenger component to reduce the num of 
threads? For example, use event-driven mode.

Thanks,
Guang
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman
>  wrote:
>> Hi,
>> 
>> On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to
>> create a erasure coded pool with 4096 pgs, but this crashed the cluster.
>> I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I
>> then try to expand to 4096 (not even quite enough) the cluster crashes
>> again. ( Do we need less of pg's with erasure coding?)
>> 
>> The crash starts with individual OSDs crashing, eventually bringing down the
>> mons (until there is no more quorum or too few osds)
>> 
>> Out of the logs:
>> 
>> 
>>   -16> 2014-05-20 10:31:55.545590 7fd42f34d700  5 -- op tracker -- , seq:
>> 14301, time: 2014-05-20 10:31:55.545590, event: started, request:
>> pg_query(0.974 epoch 3315) v3
>>   -15> 2014-05-20 10:31:55.545776 7fd42f34d700  1 --
>> 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974
>> epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9
>> 026b40
>>   -14> 2014-05-20 10:31:55.545807 7fd42f34d700  5 -- op tracker -- , seq:
>> 14301, time: 2014-05-20 10:31:55.545807, event: done, request:
>> pg_query(0.974 epoch 3315) v3
>>   -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700  1 --
>> 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0
>> l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0
>>   -12> 2014-05-20 10:31:55.564034 7fd3bf72f700  1 --
>> 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0
>> l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0
>>   -11> 2014-05-20 10:31:55.627776 7fd42df4b700  1 --
>> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 
>> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2  47+0+0
>> (855262282 0 0) 0xb6863c0 con 0x1255b9c0
>>   -10> 2014-05-20 10:31:55.629425 7fd42df4b700  1 --
>> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 
>> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2  47+0+0
>> (2581193378 0 0) 0x93d6c80 con 0x1255b9c0
>>-9> 2014-05-20 10:31:55.631270 7fd42f34d700  1 --
>> 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 
>> pg_query(7.3ffs6 epoch 3326) v3  144+0+0 (221596234 0 0) 0x10b994a0 con
>> 0x9383860
>>-8> 2014-05-20 10:31:55.631308 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-7> 2014-05-20 10:31:55.631315 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-6> 2014-05-20 10:31:55.631339 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-5> 2014-05-20 10:31:55.631343 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-4> 2014-05-20 10:31:55.631349 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-3> 2014-05-20 10:31:55.631363 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631363, event: started, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-2> 2014-05-20 10:31:55.631402 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631402, event: done, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-1> 2014-05-20 10:31:55.631488 7fd427b41700  1 --
>> 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 --
>> pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860
>> 0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In
>> function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20
>> 10:31:55.630937
>> common/Thread.cc: 110: FAILED assert(ret == 0)
>> 
>> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>>

Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-22 Thread Henrik Korkuc
On 2014.05.22 19:55, Gregory Farnum wrote:
> On Thu, May 22, 2014 at 4:09 AM, Kenneth Waegeman
>  wrote:
>> - Message from Gregory Farnum  -
>>Date: Wed, 21 May 2014 15:46:17 -0700
>>
>>From: Gregory Farnum 
>> Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool
>>  To: Kenneth Waegeman 
>>  Cc: ceph-users 
>>
>>
>>> On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman
>>>  wrote:
>>>> Thanks! I increased the max processes parameter for all daemons quite a
>>>> lot
>>>> (until ulimit -u 3802720)
>>>>
>>>> These are the limits for the daemons now..
>>>> [root@ ~]# cat /proc/17006/limits
>>>> Limit Soft Limit   Hard Limit   Units
>>>> Max cpu time  unlimitedunlimited
>>>> seconds
>>>> Max file size unlimitedunlimitedbytes
>>>> Max data size unlimitedunlimitedbytes
>>>> Max stack size10485760 unlimitedbytes
>>>> Max core file sizeunlimitedunlimitedbytes
>>>> Max resident set  unlimitedunlimitedbytes
>>>> Max processes 3802720  3802720
>>>> processes
>>>> Max open files3276832768files
>>>> Max locked memory 6553665536bytes
>>>> Max address space unlimitedunlimitedbytes
>>>> Max file locksunlimitedunlimitedlocks
>>>> Max pending signals   9506895068
>>>> signals
>>>> Max msgqueue size 819200   819200   bytes
>>>> Max nice priority 00
>>>> Max realtime priority 00
>>>> Max realtime timeout  unlimitedunlimitedus
>>>>
>>>> But this didn't help. Are there other parameters I should change?
>>>
>>> Hrm, is it exactly the same stack trace? You might need to bump the
>>> open files limit as well, although I'd be surprised. :/
>>
>> I increased the open file limit as test to 128000, still the same results.
>>
>> Stack trace:
> 
>
>> But I see some things happening on the system while doing this too:
>>
>>
>>
>> [root@ ~]# ceph osd pool set ecdata15 pgp_num 4096
>> set pool 16 pgp_num to 4096
>> [root@ ~]# ceph status
>> Traceback (most recent call last):
>>   File "/usr/bin/ceph", line 830, in 
>> sys.exit(main())
>>   File "/usr/bin/ceph", line 590, in main
>> conffile=conffile)
>>   File "/usr/lib/python2.6/site-packages/rados.py", line 198, in __init__
>> librados_path = find_library('rados')
>>   File "/usr/lib64/python2.6/ctypes/util.py", line 209, in find_library
>> return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
>>   File "/usr/lib64/python2.6/ctypes/util.py", line 203, in
>> _findSoname_ldconfig
>> os.popen('LANG=C /sbin/ldconfig -p 2>/dev/null').read())
>> OSError: [Errno 12] Cannot allocate memory
>> [root@ ~]# lsof | wc
>> -bash: fork: Cannot allocate memory
>> [root@ ~]# lsof | wc
>>   21801  211209 3230028
>> [root@ ~]# ceph status
>> ^CError connecting to cluster: InterruptedOrTimeoutError
>> ^[[A[root@ ~]# lsof | wc
>>2028   17476  190947
>>
>>
>>
>> And meanwhile the daemons has then been crashed.
>>
>> I verified the memory never ran out.
> Is there anything in dmesg? It sure looks like the OS thinks it's run
> out of memory one way or another.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

May it be related to memory fragmentation?
http://dom.as/2014/01/17/on-swapping-and-kernels/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-22 Thread Gregory Farnum
On Thu, May 22, 2014 at 4:09 AM, Kenneth Waegeman
 wrote:
>
> - Message from Gregory Farnum  -
>Date: Wed, 21 May 2014 15:46:17 -0700
>
>From: Gregory Farnum 
> Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool
>  To: Kenneth Waegeman 
>  Cc: ceph-users 
>
>
>> On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman
>>  wrote:
>>>
>>> Thanks! I increased the max processes parameter for all daemons quite a
>>> lot
>>> (until ulimit -u 3802720)
>>>
>>> These are the limits for the daemons now..
>>> [root@ ~]# cat /proc/17006/limits
>>> Limit Soft Limit   Hard Limit   Units
>>> Max cpu time  unlimitedunlimited
>>> seconds
>>> Max file size unlimitedunlimitedbytes
>>> Max data size unlimitedunlimitedbytes
>>> Max stack size10485760 unlimitedbytes
>>> Max core file sizeunlimitedunlimitedbytes
>>> Max resident set  unlimitedunlimitedbytes
>>> Max processes 3802720  3802720
>>> processes
>>> Max open files3276832768files
>>> Max locked memory 6553665536bytes
>>> Max address space unlimitedunlimitedbytes
>>> Max file locksunlimitedunlimitedlocks
>>> Max pending signals   9506895068
>>> signals
>>> Max msgqueue size 819200   819200   bytes
>>> Max nice priority 00
>>> Max realtime priority 00
>>> Max realtime timeout  unlimitedunlimitedus
>>>
>>> But this didn't help. Are there other parameters I should change?
>>
>>
>> Hrm, is it exactly the same stack trace? You might need to bump the
>> open files limit as well, although I'd be surprised. :/
>
>
> I increased the open file limit as test to 128000, still the same results.
>
> Stack trace:



> But I see some things happening on the system while doing this too:
>
>
>
> [root@ ~]# ceph osd pool set ecdata15 pgp_num 4096
> set pool 16 pgp_num to 4096
> [root@ ~]# ceph status
> Traceback (most recent call last):
>   File "/usr/bin/ceph", line 830, in 
> sys.exit(main())
>   File "/usr/bin/ceph", line 590, in main
> conffile=conffile)
>   File "/usr/lib/python2.6/site-packages/rados.py", line 198, in __init__
> librados_path = find_library('rados')
>   File "/usr/lib64/python2.6/ctypes/util.py", line 209, in find_library
> return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
>   File "/usr/lib64/python2.6/ctypes/util.py", line 203, in
> _findSoname_ldconfig
> os.popen('LANG=C /sbin/ldconfig -p 2>/dev/null').read())
> OSError: [Errno 12] Cannot allocate memory
> [root@ ~]# lsof | wc
> -bash: fork: Cannot allocate memory
> [root@ ~]# lsof | wc
>   21801  211209 3230028
> [root@ ~]# ceph status
> ^CError connecting to cluster: InterruptedOrTimeoutError
> ^[[A[root@ ~]# lsof | wc
>2028   17476  190947
>
>
>
> And meanwhile the daemons has then been crashed.
>
> I verified the memory never ran out.

Is there anything in dmesg? It sure looks like the OS thinks it's run
out of memory one way or another.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-22 Thread Kenneth Waegeman


- Message from Gregory Farnum  -
   Date: Wed, 21 May 2014 15:46:17 -0700
   From: Gregory Farnum 
Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool
 To: Kenneth Waegeman 
 Cc: ceph-users 



On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman
 wrote:

Thanks! I increased the max processes parameter for all daemons quite a lot
(until ulimit -u 3802720)

These are the limits for the daemons now..
[root@ ~]# cat /proc/17006/limits
Limit Soft Limit   Hard Limit   Units
Max cpu time  unlimitedunlimitedseconds
Max file size unlimitedunlimitedbytes
Max data size unlimitedunlimitedbytes
Max stack size10485760 unlimitedbytes
Max core file sizeunlimitedunlimitedbytes
Max resident set  unlimitedunlimitedbytes
Max processes 3802720  3802720
processes
Max open files3276832768files
Max locked memory 6553665536bytes
Max address space unlimitedunlimitedbytes
Max file locksunlimitedunlimitedlocks
Max pending signals   9506895068signals
Max msgqueue size 819200   819200   bytes
Max nice priority 00
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus

But this didn't help. Are there other parameters I should change?


Hrm, is it exactly the same stack trace? You might need to bump the
open files limit as well, although I'd be surprised. :/


I increased the open file limit as test to 128000, still the same results.

Stack trace:

   -16> 2014-05-22 11:10:05.262456 7f3bfcaee700  5 osd.398 pg_epoch:  
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326  
6293/6310/6293)  
[255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14  
lpr=6310 pi=6293-6309/1 crt=0'0 active] exit  
Started/ReplicaActive/RepNotRecovering 52.314752 4 0.000408
   -15> 2014-05-22 11:10:05.262649 7f3bfcaee700  5 osd.398 pg_epoch:  
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326  
6293/6310/6293)  
[255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14  
lpr=6310 pi=6293-6309/1 crt=0'0 active] exit Started/ReplicaActive  
52.315020 0 0.00
   -14> 2014-05-22 11:10:05.262667 7f3bfcaee700  5 osd.398 pg_epoch:  
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326  
6293/6310/6293)  
[255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14  
lpr=6310 pi=6293-6309/1 crt=0'0 active] exit Started 55.181842 0  
0.00
   -13> 2014-05-22 11:10:05.262681 7f3bfcaee700  5 osd.398 pg_epoch:  
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326  
6293/6310/6293)  
[255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14  
lpr=6310 pi=6293-6309/1 crt=0'0 active] enter Reset
   -12> 2014-05-22 11:10:05.262797 7f3bfcaee700  5 osd.398 pg_epoch:  
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326  
6327/6327/6327)  
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1  
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] exit Reset 0.000117 1  
0.000338
   -11> 2014-05-22 11:10:05.262956 7f3bfcaee700  5 osd.398 pg_epoch:  
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326  
6327/6327/6327)  
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1  
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] enter Started
   -10> 2014-05-22 11:10:05.262983 7f3bfcaee700  5 osd.398 pg_epoch:  
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326  
6327/6327/6327)  
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1  
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] enter Start
-9> 2014-05-22 11:10:05.262994 7f3bfcaee700  1 osd.398 pg_epoch:  
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326  
6327/6327/6327)  
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1  
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] state:  
transitioning to Stray
-8> 2014-05-22 11:10:05.263151 7f3bfcaee700  5 osd.398 pg_epoch:  
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326  
6327/6327/6327)  
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1  
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] exit Start 0.000169 0  
0.00
-7> 2014-05-22 11:10:05.263385 7f3bfcaee700  5 osd.398 pg_epoch:  
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326  
6327/6327/6327)  
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1  
lpr

Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-21 Thread Gregory Farnum
On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman
 wrote:
> Thanks! I increased the max processes parameter for all daemons quite a lot
> (until ulimit -u 3802720)
>
> These are the limits for the daemons now..
> [root@ ~]# cat /proc/17006/limits
> Limit Soft Limit   Hard Limit   Units
> Max cpu time  unlimitedunlimitedseconds
> Max file size unlimitedunlimitedbytes
> Max data size unlimitedunlimitedbytes
> Max stack size10485760 unlimitedbytes
> Max core file sizeunlimitedunlimitedbytes
> Max resident set  unlimitedunlimitedbytes
> Max processes 3802720  3802720
> processes
> Max open files3276832768files
> Max locked memory 6553665536bytes
> Max address space unlimitedunlimitedbytes
> Max file locksunlimitedunlimitedlocks
> Max pending signals   9506895068signals
> Max msgqueue size 819200   819200   bytes
> Max nice priority 00
> Max realtime priority 00
> Max realtime timeout  unlimitedunlimitedus
>
> But this didn't help. Are there other parameters I should change?

Hrm, is it exactly the same stack trace? You might need to bump the
open files limit as well, although I'd be surprised. :/
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-21 Thread Kenneth Waegeman
Thanks! I increased the max processes parameter for all daemons quite  
a lot (until ulimit -u 3802720)


These are the limits for the daemons now..
[root@ ~]# cat /proc/17006/limits
Limit Soft Limit   Hard Limit   Units
Max cpu time  unlimitedunlimitedseconds
Max file size unlimitedunlimitedbytes
Max data size unlimitedunlimitedbytes
Max stack size10485760 unlimitedbytes
Max core file sizeunlimitedunlimitedbytes
Max resident set  unlimitedunlimitedbytes
Max processes 3802720  3802720  processes
Max open files3276832768files
Max locked memory 6553665536bytes
Max address space unlimitedunlimitedbytes
Max file locksunlimitedunlimitedlocks
Max pending signals   9506895068signals
Max msgqueue size 819200   819200   bytes
Max nice priority 00
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus

But this didn't help. Are there other parameters I should change?

I also got an 'bash: fork: Cannot allocate memory' error once when  
running a command after starting the ceph services. It shouldn't be a  
memory shortage issue itself because when monitoring the failure there  
is still enough (cached) available..



- Message from Gregory Farnum  -
   Date: Tue, 20 May 2014 10:33:30 -0700
   From: Gregory Farnum 
Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool
 To: Kenneth Waegeman 
 Cc: ceph-users 



This failure means the messenger subsystem is trying to create a
thread and is getting an error code back — probably due to a process
or system thread limit that you can turn up with ulimit.

This is happening because a replicated PG primary needs a connection
to only its replicas (generally 1 or 2 connections), but with an
erasure-coded PG the primary requires a connection to m+n-1 replicas
(everybody who's in the erasure-coding set, including itself). Right
now our messenger requires a thread for each connection, so kerblam.
(And it actually requires a couple such connections because we have
separate heartbeat, cluster data, and client data systems.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman
 wrote:

Hi,

On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to
create a erasure coded pool with 4096 pgs, but this crashed the cluster.
I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I
then try to expand to 4096 (not even quite enough) the cluster crashes
again. ( Do we need less of pg's with erasure coding?)

The crash starts with individual OSDs crashing, eventually bringing down the
mons (until there is no more quorum or too few osds)

Out of the logs:


   -16> 2014-05-20 10:31:55.545590 7fd42f34d700  5 -- op tracker -- , seq:
14301, time: 2014-05-20 10:31:55.545590, event: started, request:
pg_query(0.974 epoch 3315) v3
   -15> 2014-05-20 10:31:55.545776 7fd42f34d700  1 --
130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974
epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9
026b40
   -14> 2014-05-20 10:31:55.545807 7fd42f34d700  5 -- op tracker -- , seq:
14301, time: 2014-05-20 10:31:55.545807, event: done, request:
pg_query(0.974 epoch 3315) v3
   -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700  1 --
130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0
l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0
   -12> 2014-05-20 10:31:55.564034 7fd3bf72f700  1 --
130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0
l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0
   -11> 2014-05-20 10:31:55.627776 7fd42df4b700  1 --
130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 
osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2  47+0+0
(855262282 0 0) 0xb6863c0 con 0x1255b9c0
   -10> 2014-05-20 10:31:55.629425 7fd42df4b700  1 --
130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 
osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2  47+0+0
(2581193378 0 0) 0x93d6c80 con 0x1255b9c0
-9> 2014-05-20 10:31:55.631270 7fd42f34d700  1 --
130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 
pg_query(7.3ffs6 epoch 3326) v3  144+0+0 (221596234 0 0) 0x10b994a0 con
0x9383860
-8> 2014-05-20 10:31:55.631308 7fd42f34d700  5 -- op tracker -- , seq:
14302,

Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-20 Thread Gregory Farnum
This failure means the messenger subsystem is trying to create a
thread and is getting an error code back — probably due to a process
or system thread limit that you can turn up with ulimit.

This is happening because a replicated PG primary needs a connection
to only its replicas (generally 1 or 2 connections), but with an
erasure-coded PG the primary requires a connection to m+n-1 replicas
(everybody who's in the erasure-coding set, including itself). Right
now our messenger requires a thread for each connection, so kerblam.
(And it actually requires a couple such connections because we have
separate heartbeat, cluster data, and client data systems.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman
 wrote:
> Hi,
>
> On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to
> create a erasure coded pool with 4096 pgs, but this crashed the cluster.
> I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I
> then try to expand to 4096 (not even quite enough) the cluster crashes
> again. ( Do we need less of pg's with erasure coding?)
>
> The crash starts with individual OSDs crashing, eventually bringing down the
> mons (until there is no more quorum or too few osds)
>
> Out of the logs:
>
>
>-16> 2014-05-20 10:31:55.545590 7fd42f34d700  5 -- op tracker -- , seq:
> 14301, time: 2014-05-20 10:31:55.545590, event: started, request:
> pg_query(0.974 epoch 3315) v3
>-15> 2014-05-20 10:31:55.545776 7fd42f34d700  1 --
> 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974
> epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9
> 026b40
>-14> 2014-05-20 10:31:55.545807 7fd42f34d700  5 -- op tracker -- , seq:
> 14301, time: 2014-05-20 10:31:55.545807, event: done, request:
> pg_query(0.974 epoch 3315) v3
>-13> 2014-05-20 10:31:55.559661 7fd3fdb0f700  1 --
> 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0
> l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0
>-12> 2014-05-20 10:31:55.564034 7fd3bf72f700  1 --
> 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0
> l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0
>-11> 2014-05-20 10:31:55.627776 7fd42df4b700  1 --
> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 
> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2  47+0+0
> (855262282 0 0) 0xb6863c0 con 0x1255b9c0
>-10> 2014-05-20 10:31:55.629425 7fd42df4b700  1 --
> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 
> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2  47+0+0
> (2581193378 0 0) 0x93d6c80 con 0x1255b9c0
> -9> 2014-05-20 10:31:55.631270 7fd42f34d700  1 --
> 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 
> pg_query(7.3ffs6 epoch 3326) v3  144+0+0 (221596234 0 0) 0x10b994a0 con
> 0x9383860
> -8> 2014-05-20 10:31:55.631308 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request:
> pg_query(7.3ffs6 epoch 3326) v3
> -7> 2014-05-20 10:31:55.631315 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request:
> pg_query(7.3ffs6 epoch 3326) v3
> -6> 2014-05-20 10:31:55.631339 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request:
> pg_query(7.3ffs6 epoch 3326) v3
> -5> 2014-05-20 10:31:55.631343 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request:
> pg_query(7.3ffs6 epoch 3326) v3
> -4> 2014-05-20 10:31:55.631349 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request:
> pg_query(7.3ffs6 epoch 3326) v3
> -3> 2014-05-20 10:31:55.631363 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631363, event: started, request:
> pg_query(7.3ffs6 epoch 3326) v3
> -2> 2014-05-20 10:31:55.631402 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631402, event: done, request:
> pg_query(7.3ffs6 epoch 3326) v3
> -1> 2014-05-20 10:31:55.631488 7fd427b41700  1 --
> 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 --
> pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860
>  0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In
> function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20
> 10:31:55.630937
> common/Thread.cc: 110: FAILED assert(ret == 0)
>
>  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>  1: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
>  2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa]
>  3: (Accepter::entry()+0x265) [0xb3ca45]
>  4: (()+0x79d1) [0x7fd4436b19d1]
>  5: (clone()+0x6d) [0x7fd4423ecb6d]
>
> --- begin dump of recent events ---
>  0> 2014-05-20 1

[ceph-users] Expanding pg's of an erasure coded pool

2014-05-20 Thread Kenneth Waegeman

Hi,

On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first  
tried to create a erasure coded pool with 4096 pgs, but this crashed  
the cluster.
I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num),  
when I then try to expand to 4096 (not even quite enough) the cluster  
crashes again. ( Do we need less of pg's with erasure coding?)


The crash starts with individual OSDs crashing, eventually bringing  
down the mons (until there is no more quorum or too few osds)


Out of the logs:


   -16> 2014-05-20 10:31:55.545590 7fd42f34d700  5 -- op tracker -- ,  
seq: 14301, time: 2014-05-20 10:31:55.545590, event: started, request:  
pg_query(0.974 epoch 3315) v3
   -15> 2014-05-20 10:31:55.545776 7fd42f34d700  1 --  
130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 --  
pg_notify(0.974 epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9

026b40
   -14> 2014-05-20 10:31:55.545807 7fd42f34d700  5 -- op tracker -- ,  
seq: 14301, time: 2014-05-20 10:31:55.545807, event: done, request:  
pg_query(0.974 epoch 3315) v3
   -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700  1 --  
130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0  
pgs=0 cs=0 l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0
   -12> 2014-05-20 10:31:55.564034 7fd3bf72f700  1 --  
130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0  
pgs=0 cs=0 l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0
   -11> 2014-05-20 10:31:55.627776 7fd42df4b700  1 --  
130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3   
osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2   
47+0+0 (855262282 0 0) 0xb6863c0 con 0x1255b9c0
   -10> 2014-05-20 10:31:55.629425 7fd42df4b700  1 --  
130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4   
osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2   
47+0+0 (2581193378 0 0) 0x93d6c80 con 0x1255b9c0
-9> 2014-05-20 10:31:55.631270 7fd42f34d700  1 --  
130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2  
 pg_query(7.3ffs6 epoch 3326) v3  144+0+0 (221596234 0 0)  
0x10b994a0 con 0x9383860
-8> 2014-05-20 10:31:55.631308 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631130, event: header_read,  
request: pg_query(7.3ffs6 epoch 3326) v3
-7> 2014-05-20 10:31:55.631315 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631133, event: throttled,  
request: pg_query(7.3ffs6 epoch 3326) v3
-6> 2014-05-20 10:31:55.631339 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631207, event: all_read,  
request: pg_query(7.3ffs6 epoch 3326) v3
-5> 2014-05-20 10:31:55.631343 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631303, event: dispatched,  
request: pg_query(7.3ffs6 epoch 3326) v3
-4> 2014-05-20 10:31:55.631349 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631349, event:  
waiting_for_osdmap, request: pg_query(7.3ffs6 epoch 3326) v3
-3> 2014-05-20 10:31:55.631363 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631363, event: started, request:  
pg_query(7.3ffs6 epoch 3326) v3
-2> 2014-05-20 10:31:55.631402 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631402, event: done, request:  
pg_query(7.3ffs6 epoch 3326) v3
-1> 2014-05-20 10:31:55.631488 7fd427b41700  1 --  
130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 --  
pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860
 0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc:  
In function 'void Thread::create(size_t)' thread 7fd42cb49700 time  
2014-05-20 10:31:55.630937

common/Thread.cc: 110: FAILED assert(ret == 0)

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa]
 3: (Accepter::entry()+0x265) [0xb3ca45]
 4: (()+0x79d1) [0x7fd4436b19d1]
 5: (clone()+0x6d) [0x7fd4423ecb6d]

--- begin dump of recent events ---
 0> 2014-05-20 10:31:56.622247 7fd3bc5fe700 -1 *** Caught signal  
(Aborted) **

 in thread 7fd3bc5fe700

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: /usr/bin/ceph-osd() [0x9ab3b1]
 2: (()+0xf710) [0x7fd4436b9710]
 3: (gsignal()+0x35) [0x7fd442336925]
 4: (abort()+0x175) [0x7fd442338105]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fd442bf0a5d]
 6: (()+0xbcbe6) [0x7fd442beebe6]
 7: (()+0xbcc13) [0x7fd442beec13]
 8: (()+0xbcd0e) [0x7fd442beed0e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x7f2) [0xaec612]

 10: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
 11: (Pipe::connect()+0x2efb) [0xb2850b]
 12: (Pipe::writer()+0x9f3) [0xb2a063]
 13: (Pipe::Writer::entry()+0xd) [0xb359cd]
 14: (()+0x79d1) [0x7fd4436b19d1]
 15: (clone()+0x6d) [0x7fd4423ecb6d]
 NOTE: a copy of the executable, or `obj