Re: Ceph version 0.56.1, data loss on power failure
Le 16/01/2013 17:56, Jeff Mitchell a écrit : FWIW, my ceph data dirs (for e.g. mons) are all on XFS. I've experienced a lot of corruption on these on power loss to the node -- and in some cases even when power wasn't lost, and the box was simply rebooted. This is on Ubuntu 12.04 with the ceph-provied 3.6.3 kernel (as I'm using RBD on these). It's pretty much to the point where I'm thinking of changing them all over to ext4 for these data dirs, as the hassle of rebuilding mons constantly is just not worth the trouble. In october, I've lost a complete ceph cluster, because of a combination of a memory management bug in kernel 3.6 + a bug in XFS (another BUG) (I Had 12 Nodes, replication was at 2, 5/6 machines were crashed in a row, because of mm bug, and 2 ended with unrecoverable corruption) so, 150 TB of data on the cluster were unrecoverable. Hopefully it was only test data. if you want the gory details see here : http://oss.sgi.com/archives/xfs/2012-10/msg00420.html This XFS bug was corrected in 3.0.52, 3.2.34,3.4.19,3.6.7. Dave chinner was very quick to fix the problem. Add the last bug, (journal not flushed properly), not yet fixed on latest kernels I can understand your reaction... But, believe it or not, I'm still confident with XFS. I've been using it for more than 10 years on TB and TB of data, and apart those recents problems , XFS have been extremely good (stability, performance, crash tolerance) all this time. Not saying ext4 isn't good, but if you follow kernel developpement, you'll see that it's not bug-free either... And not speaking of btrfs which was totally unstable with ceph on my last tries (6 month ago) In fact, ceph is hammering hardware strongly, so it's very good to find bugs in linux kernel :) So, for the moment, i'm sticking with 3.4.25 kernel. Longterm kernel, proven, stable : no mm problems, no xfs problems. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Single host VM limit when using RBD
I've run into a limit on the maximum number of RBD backed VM's that I'm able to run on a single host. I have 20 VM's (21 RBD volumes open) running on a single host and when booting the 21st machine I get the below error from libvirt/QEMU. I'm able to shut down a VM and start another in it's place so there seems to be a hard limit on the amount of volumes I'm able to have open. I did some googling and the error 11 from pthread_create seems to mean 'resource unavailable' so I'm probably running into a thread limit of some sort. I did try increasing the max_thread kernel option but nothing changed. I moved a few VM's to a different empty host and they start with no issues at all. This machine has 4 OSD's running on it in addition to the 20 VM's. Kernel 3.7.1. Ceph 0.56.1 and QEMU 1.3.0. There is currently 65GB of 96GB free ram and no swap. Can anyone suggest where the limit might be or anything I can do to narrow down the problem? Thanks -Matt - Error starting domain: internal error Process exited while reading console log output: char device redirected to /dev/pts/23 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 02:32:58.096437 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (()+0x2aaa8f) [0x7f4eb2de8a8f] 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575] 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc] 4: (()+0xa0290) [0x7f4eb5b27290] 5: (()+0x879dd) [0x7f4eb5b0e9dd] 6: (()+0x87c1b) [0x7f4eb5b0ec1b] 7: (()+0x87ae1) [0x7f4eb5b0eae1] 8: (()+0x87d50) [0x7f4eb5b0ed50] 9: (()+0xb37b2) [0x7f4eb5b3a7b2] 10: (()+0x1e83eb) [0x7f4eb5c6f3eb] 11: (()+0x1ab54a) [0x7f4eb5c3254a] 12: (main()+0x9da) [0x7f4eb5c72a3a] 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd] 14: (()+0x710b9) [0x7f4eb5af80b9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after Traceback (most recent call last): File /usr/share/virt-manager/virtManager/asyncjob.py, line 96, in cb_wrapper callback(asyncjob, *args, **kwargs) File /usr/share/virt-manager/virtManager/asyncjob.py, line 117, in tmpcb callback(*args, **kwargs) File /usr/share/virt-manager/virtManager/domain.py, line 1090, in startup self._backend.create() File /usr/lib/python2.7/dist-packages/libvirt.py, line 620, in create if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self) libvirtError: internal error Process exited while reading console log output: char device redirected to /dev/pts/23 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 02:32:58.096437 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (()+0x2aaa8f) [0x7f4eb2de8a8f] 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575] 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc] 4: (()+0xa0290) [0x7f4eb5b27290] 5: (()+0x879dd) [0x7f4eb5b0e9dd] 6: (()+0x87c1b) [0x7f4eb5b0ec1b] 7: (()+0x87ae1) [0x7f4eb5b0eae1] 8: (()+0x87d50) [0x7f4eb5b0ed50] 9: (()+0xb37b2) [0x7f4eb5b3a7b2] 10: (()+0x1e83eb) [0x7f4eb5c6f3eb] 11: (()+0x1ab54a) [0x7f4eb5c3254a] 12: (main()+0x9da) [0x7f4eb5c72a3a] 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd] 14: (()+0x710b9) [0x7f4eb5af80b9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Single host VM limit when using RBD
Hi Matthew, Seems to a low value in /proc/sys/kernel/threads-max value. On Thu, Jan 17, 2013 at 12:37 PM, Matthew Anderson matth...@base3.com.au wrote: I've run into a limit on the maximum number of RBD backed VM's that I'm able to run on a single host. I have 20 VM's (21 RBD volumes open) running on a single host and when booting the 21st machine I get the below error from libvirt/QEMU. I'm able to shut down a VM and start another in it's place so there seems to be a hard limit on the amount of volumes I'm able to have open. I did some googling and the error 11 from pthread_create seems to mean 'resource unavailable' so I'm probably running into a thread limit of some sort. I did try increasing the max_thread kernel option but nothing changed. I moved a few VM's to a different empty host and they start with no issues at all. This machine has 4 OSD's running on it in addition to the 20 VM's. Kernel 3.7.1. Ceph 0.56.1 and QEMU 1.3.0. There is currently 65GB of 96GB free ram and no swap. Can anyone suggest where the limit might be or anything I can do to narrow down the problem? Thanks -Matt - Error starting domain: internal error Process exited while reading console log output: char device redirected to /dev/pts/23 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 02:32:58.096437 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (()+0x2aaa8f) [0x7f4eb2de8a8f] 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575] 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc] 4: (()+0xa0290) [0x7f4eb5b27290] 5: (()+0x879dd) [0x7f4eb5b0e9dd] 6: (()+0x87c1b) [0x7f4eb5b0ec1b] 7: (()+0x87ae1) [0x7f4eb5b0eae1] 8: (()+0x87d50) [0x7f4eb5b0ed50] 9: (()+0xb37b2) [0x7f4eb5b3a7b2] 10: (()+0x1e83eb) [0x7f4eb5c6f3eb] 11: (()+0x1ab54a) [0x7f4eb5c3254a] 12: (main()+0x9da) [0x7f4eb5c72a3a] 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd] 14: (()+0x710b9) [0x7f4eb5af80b9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after Traceback (most recent call last): File /usr/share/virt-manager/virtManager/asyncjob.py, line 96, in cb_wrapper callback(asyncjob, *args, **kwargs) File /usr/share/virt-manager/virtManager/asyncjob.py, line 117, in tmpcb callback(*args, **kwargs) File /usr/share/virt-manager/virtManager/domain.py, line 1090, in startup self._backend.create() File /usr/lib/python2.7/dist-packages/libvirt.py, line 620, in create if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self) libvirtError: internal error Process exited while reading console log output: char device redirected to /dev/pts/23 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 02:32:58.096437 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (()+0x2aaa8f) [0x7f4eb2de8a8f] 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575] 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc] 4: (()+0xa0290) [0x7f4eb5b27290] 5: (()+0x879dd) [0x7f4eb5b0e9dd] 6: (()+0x87c1b) [0x7f4eb5b0ec1b] 7: (()+0x87ae1) [0x7f4eb5b0eae1] 8: (()+0x87d50) [0x7f4eb5b0ed50] 9: (()+0xb37b2) [0x7f4eb5b3a7b2] 10: (()+0x1e83eb) [0x7f4eb5c6f3eb] 11: (()+0x1ab54a) [0x7f4eb5c3254a] 12: (main()+0x9da) [0x7f4eb5c72a3a] 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd] 14: (()+0x710b9) [0x7f4eb5af80b9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Single host VM limit when using RBD
Hi Audrey, I did try your suggestion beforehand and it doesn't appear to fix the issue. [root@KVM04 ~]# cat /proc/sys/kernel/threads-max 2549635 [root@KVM04 ~]# echo 5549635 /proc/sys/kernel/threads-max [root@KVM04 ~]# virsh start EX03 error: Failed to start domain EX03 error: internal error Process exited while reading console log output: char device redirected to /dev/pts/23 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f5ec9706960 time 2013-01-17 16:46:50.935681 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (()+0x2aaa8f) [0x7f5ec6a89a8f] 2: (SafeTimer::init()+0x95) [0x7f5ec6973575] 3: (librados::RadosClient::connect()+0x72c) [0x7f5ec69099dc] 4: (()+0xa0290) [0x7f5ec97c8290] 5: (()+0x879dd) [0x7f5ec97af9dd] 6: (()+0x87c1b) [0x7f5ec97afc1b] 7: (()+0x87ae1) [0x7f5ec97afae1] 8: (()+0x87d50) [0x7f5ec97afd50] 9: (()+0xb37b2) [0x7f5ec97db7b2] 10: (()+0x1e83eb) [0x7f5ec99103eb] 11: (()+0x1ab54a) [0x7f5ec98d354a] 12: (main()+0x9da) [0x7f5ec9913a3a] 13: (__libc_start_main()+0xfd) [0x7f5ec5755cdd] 14: (()+0x710b9) [0x7f5ec97990b9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after -Original Message- From: Andrey Korolyov [mailto:and...@xdel.ru] Sent: Thursday, 17 January 2013 4:42 PM To: Matthew Anderson Cc: ceph-devel@vger.kernel.org Subject: Re: Single host VM limit when using RBD Hi Matthew, Seems to a low value in /proc/sys/kernel/threads-max value. On Thu, Jan 17, 2013 at 12:37 PM, Matthew Anderson matth...@base3.com.au wrote: I've run into a limit on the maximum number of RBD backed VM's that I'm able to run on a single host. I have 20 VM's (21 RBD volumes open) running on a single host and when booting the 21st machine I get the below error from libvirt/QEMU. I'm able to shut down a VM and start another in it's place so there seems to be a hard limit on the amount of volumes I'm able to have open. I did some googling and the error 11 from pthread_create seems to mean 'resource unavailable' so I'm probably running into a thread limit of some sort. I did try increasing the max_thread kernel option but nothing changed. I moved a few VM's to a different empty host and they start with no issues at all. This machine has 4 OSD's running on it in addition to the 20 VM's. Kernel 3.7.1. Ceph 0.56.1 and QEMU 1.3.0. There is currently 65GB of 96GB free ram and no swap. Can anyone suggest where the limit might be or anything I can do to narrow down the problem? Thanks -Matt - Error starting domain: internal error Process exited while reading console log output: char device redirected to /dev/pts/23 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 02:32:58.096437 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (()+0x2aaa8f) [0x7f4eb2de8a8f] 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575] 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc] 4: (()+0xa0290) [0x7f4eb5b27290] 5: (()+0x879dd) [0x7f4eb5b0e9dd] 6: (()+0x87c1b) [0x7f4eb5b0ec1b] 7: (()+0x87ae1) [0x7f4eb5b0eae1] 8: (()+0x87d50) [0x7f4eb5b0ed50] 9: (()+0xb37b2) [0x7f4eb5b3a7b2] 10: (()+0x1e83eb) [0x7f4eb5c6f3eb] 11: (()+0x1ab54a) [0x7f4eb5c3254a] 12: (main()+0x9da) [0x7f4eb5c72a3a] 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd] 14: (()+0x710b9) [0x7f4eb5af80b9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after Traceback (most recent call last): File /usr/share/virt-manager/virtManager/asyncjob.py, line 96, in cb_wrapper callback(asyncjob, *args, **kwargs) File /usr/share/virt-manager/virtManager/asyncjob.py, line 117, in tmpcb callback(*args, **kwargs) File /usr/share/virt-manager/virtManager/domain.py, line 1090, in startup self._backend.create() File /usr/lib/python2.7/dist-packages/libvirt.py, line 620, in create if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self) libvirtError: internal error Process exited while reading console log output: char device redirected to /dev/pts/23 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 02:32:58.096437 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (()+0x2aaa8f) [0x7f4eb2de8a8f] 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575] 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc] 4: (()+0xa0290) [0x7f4eb5b27290] 5: (()+0x879dd) [0x7f4eb5b0e9dd] 6: (()+0x87c1b) [0x7f4eb5b0ec1b] 7: (()+0x87ae1) [0x7f4eb5b0eae1]
Re: flashcache
On 17 January 2013 20:46, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/16 Mark Nelson mark.nel...@inktank.com: I don't know if I have to use a single two port IB card (switch redundancy and no card redundancy) or I have to use two single port cards. (or a single one port IB?) On the topic of IB.. But slightly off-topic all the same.. I would love to attempt getting Ceph running on rsockets if I could find the time (alas we don't run Ceph). rsockets is a fully userland implementation of BSD sockets over RDMA, supporting fork and all the usual goodies, in theory unless you are using the kernel RBD module (of the kernel FS module etc) you should be able to run it on rsockets and enjoy a considerable performance increase. rsockets is available in the librdmacm git up on Open Fabrics and dev + support happens on the linux-rdma list. -- CTO | Orion Virtualisation Solutions | www.orionvm.com.au Phone: 1300 56 99 52 | Mobile: 0428 754 846 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
On 01/16/2013 11:47 PM, Stefan Priebe - Profihost AG wrote: Hi Mark, Am 16.01.2013 um 22:53 schrieb Mark With only 2 SSDs for 12 spinning disks, you'll need to make sure the SSDs are really fast. I use Intel 520s for testing which are great, but I wouldn't use them in production. Why not? I use them for a ssd only ceph cluster. Stefan It's pretty tough to get an apples-to-apples comparison of endurance when looking at the Intel 520 vs something like the DC S3700. If I were actually building out a production deployment I'd probably stick with the DC S3700 (especially if sticking journals, flashcache, and xfs journals for 6 osds on 1 drive!). There's probably a reasonable endurance per cost argument for a severely under-subscribed 520 (or other similar drive) as well. It'd be an interesting study to look at how long it takes a small enterpise drives to die vs a larger under-subscribed consumer drive. -- Mark Nelson Performance Engineer Inktank -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
On 01/17/2013 07:32 AM, Joseph Glanville wrote: On 17 January 2013 20:46, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/16 Mark Nelsonmark.nel...@inktank.com: I don't know if I have to use a single two port IB card (switch redundancy and no card redundancy) or I have to use two single port cards. (or a single one port IB?) On the topic of IB.. But slightly off-topic all the same.. I would love to attempt getting Ceph running on rsockets if I could find the time (alas we don't run Ceph). rsockets is a fully userland implementation of BSD sockets over RDMA, supporting fork and all the usual goodies, in theory unless you are using the kernel RBD module (of the kernel FS module etc) you should be able to run it on rsockets and enjoy a considerable performance increase. rsockets is available in the librdmacm git up on Open Fabrics and dev + support happens on the linux-rdma list. There's been some talk about rsockets on the list before. I think there are a couple of different folks that have tried (succeeded?) in getting it working. barring that, it sounds like if you tune interrupt affinity settings and various other bits you can get IPoIB up into the 2GB/s+ range which while not RDMA speed, is at least better than 10GbE. -- Mark Nelson Performance Engineer Inktank -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hit suicide timeout after adding new osd
Hi guys, I had a functioning Ceph system that reported HEALTH_OK. It was running with 3 osds on 3 servers. Then I added an extra osd on 1 of the servers using the commands from the documentation here: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ Shortly after I did that 2 of the existing osds crashed. I restarted them and after some hours they were up and running again, but soon one of them crashed again - and a third existing osd crashed as well. I restarted those two and waited some hours for them to come up. A short while later one of them crashed again. I have then restarted restarted that last one and watched the logs closely. It seems the same patterns repeats itself every time. It starts up doing its normal maintenance before going up (takes a long while). Then it seems to be running, but logs the following every 5 seconds: heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had timed out after 30 After some time it logs: === heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had suicide timed out after 300 2013-01-17 15:24:35.051524 7f053f149700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f053f149700 time 2013-01-17 15:24:33.849654 common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0x8462bb] 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x846a9e] 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x846cc8] 4: (CephContextServiceThread::entry()+0x55) [0x8e01c5] 5: /lib64/libpthread.so.0() [0x360de07d14] 6: (clone()+0x6d) [0x360d6f167d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 2013-01-17 15:24:35.301183 7f053f149700 -1 *** Caught signal (Aborted) ** in thread 7f053f149700 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: /usr/bin/ceph-osd() [0x82ea90] 2: /lib64/libpthread.so.0() [0x360de0efe0] 3: (gsignal()+0x35) [0x360d635925] 4: (abort()+0x148) [0x360d6370d8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x3611660dad] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. === How can I avoid this? - is it a bug, or have I done something wrong? I'm running Ceph 0.56.1 from the official RPMs on Fedora 17. The underlying disks and network connectivity has been tested and nothing seems to be wrong there. Thanks in advance for your assistance! -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
On Jan 17, 2013, at 8:37 AM, Mark Nelson mark.nel...@inktank.com wrote: On 01/17/2013 07:32 AM, Joseph Glanville wrote: On 17 January 2013 20:46, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/16 Mark Nelsonmark.nel...@inktank.com: I don't know if I have to use a single two port IB card (switch redundancy and no card redundancy) or I have to use two single port cards. (or a single one port IB?) On the topic of IB.. But slightly off-topic all the same.. I would love to attempt getting Ceph running on rsockets if I could find the time (alas we don't run Ceph). rsockets is a fully userland implementation of BSD sockets over RDMA, supporting fork and all the usual goodies, in theory unless you are using the kernel RBD module (of the kernel FS module etc) you should be able to run it on rsockets and enjoy a considerable performance increase. rsockets is available in the librdmacm git up on Open Fabrics and dev + support happens on the linux-rdma list. There's been some talk about rsockets on the list before. I think there are a couple of different folks that have tried (succeeded?) in getting it working. barring that, it sounds like if you tune interrupt affinity settings and various other bits you can get IPoIB up into the 2GB/s+ range which while not RDMA speed, is at least better than 10GbE. IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it uses the traditional socket stack through the kernel, CPU usage will be as high (or higher if QDR) than 10GbE. I would be interested in seeing if rsockets helps ceph. I have questions though. By default, rsockets still has to copy data in and out. It has extensions for zero-copy, but they do not work with non-blocking sockets. Does ceph use non-blocking sockets? How many simultaneous connections does it support before falling over into the non-rsockets path? If rsockets has a ceiling on connections and falls over to the non-socket path, can the application determine if it is safe to use the zero-copy extensions for a specific connection or do they fail gracefully? Scott-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hit suicide timeout after adding new osd
Hi, On 01/17/2013 03:35 PM, Jens Kristian Søgaard wrote: Hi guys, I had a functioning Ceph system that reported HEALTH_OK. It was running with 3 osds on 3 servers. Then I added an extra osd on 1 of the servers using the commands from the documentation here: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ Shortly after I did that 2 of the existing osds crashed. I restarted them and after some hours they were up and running again, but soon one of them crashed again - and a third existing osd crashed as well. I restarted those two and waited some hours for them to come up. A short while later one of them crashed again. I have then restarted restarted that last one and watched the logs closely. It seems the same patterns repeats itself every time. It starts up doing its normal maintenance before going up (takes a long while). Then it seems to be running, but logs the following every 5 seconds: heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had timed out after 30 After some time it logs: === heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had suicide timed out after 300 2013-01-17 15:24:35.051524 7f053f149700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f053f149700 time 2013-01-17 15:24:33.849654 common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0x8462bb] 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x846a9e] 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x846cc8] 4: (CephContextServiceThread::entry()+0x55) [0x8e01c5] 5: /lib64/libpthread.so.0() [0x360de07d14] 6: (clone()+0x6d) [0x360d6f167d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 2013-01-17 15:24:35.301183 7f053f149700 -1 *** Caught signal (Aborted) ** in thread 7f053f149700 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: /usr/bin/ceph-osd() [0x82ea90] 2: /lib64/libpthread.so.0() [0x360de0efe0] 3: (gsignal()+0x35) [0x360d635925] 4: (abort()+0x148) [0x360d6370d8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x3611660dad] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. === How can I avoid this? - is it a bug, or have I done something wrong? I think you are seeing the same issue as I noticed about two weeks ago: http://www.spinics.net/lists/ceph-devel/msg11328.html See this issue: http://tracker.newdream.net/issues/3714 I can't find branch wip-3714 anymore, so it might be already merged into next. You might want to try building from 'next' yourself or fetch some new packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/ Wido I'm running Ceph 0.56.1 from the official RPMs on Fedora 17. The underlying disks and network connectivity has been tested and nothing seems to be wrong there. Thanks in advance for your assistance! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
On Thu, Jan 17, 2013 at 7:00 PM, Atchley, Scott atchle...@ornl.gov wrote: On Jan 17, 2013, at 9:48 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/17 Atchley, Scott atchle...@ornl.gov: IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it uses the traditional socket stack through the kernel, CPU usage will be as high (or higher if QDR) than 10GbE. Which kind of tuning? Do you have a paper about this? No, I followed the Mellanox tuning guide and modified their interrupt affinity scripts. Did you tried to bind interrupts only to core to which QPI link belongs in reality and measure difference with spread-over-all-cores binding? But, actually, is possible to use ceph with IPoIB in a stable way or is this experimental ? IPoIB appears as a traditional Ethernet device to Linux and can be used as such. Not exactly, this summer kernel added additional driver for fully featured L2(ib ethernet driver), before that it was quite painful to do any possible failover using ipoib. I don't know if i support for rsocket that is experimental/untested and IPoIB is a stable workaroud or what else. IPoIB is much more used and pretty stable, while rsockets is new with limited testing. That said, more people using it will help Sean improve it. Ideally, we would like support for zero-copy and reduced CPU usage (via OS-bypass) and with more interconnects than just InfiniBand. :-) And is a dual controller needed on each OSD node? Ceph is able to handle OSD network failures? This is really important to know. It change the whole network topology. I will let others answer this. Scott-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hit suicide timeout after adding new osd
Hi, On 01/17/2013 03:50 PM, Stefan Priebe wrote: Hi, Am 17.01.2013 15:47, schrieb Wido den Hollander: You might want to try building from 'next' yourself or fetch some new packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/ Should it be backported to bobtail branch as well? Don't think it's in there. I found the commit in the next branch: https://github.com/ceph/ceph/commit/7e94f6f1a7b7a865433edacd6a521f6ea1170eac Doesn't seem to be in 'bobtail'. Wido Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hit suicide timeout after adding new osd
Hi Sage, Am 17.01.2013 16:33, schrieb Wido den Hollander: Hi, On 01/17/2013 03:50 PM, Stefan Priebe wrote: Hi, Am 17.01.2013 15:47, schrieb Wido den Hollander: You might want to try building from 'next' yourself or fetch some new packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/ Should it be backported to bobtail branch as well? Don't think it's in there. I found the commit in the next branch: https://github.com/ceph/ceph/commit/7e94f6f1a7b7a865433edacd6a521f6ea1170eac Doesn't seem to be in 'bobtail'. Wido Was this forgotten? I mean a lot of people have posted about git suicide timeout with 0.56 oder 0.56.1 so it should be also in the bobtail branch? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
On Jan 17, 2013, at 10:07 AM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Jan 17, 2013 at 7:00 PM, Atchley, Scott atchle...@ornl.gov wrote: On Jan 17, 2013, at 9:48 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/17 Atchley, Scott atchle...@ornl.gov: IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it uses the traditional socket stack through the kernel, CPU usage will be as high (or higher if QDR) than 10GbE. Which kind of tuning? Do you have a paper about this? No, I followed the Mellanox tuning guide and modified their interrupt affinity scripts. Did you tried to bind interrupts only to core to which QPI link belongs in reality and measure difference with spread-over-all-cores binding? This is the modified part. I bound the mlx4-async handler to core 0 and the mlx4-ib-1-0 handle to core 1 for our machines. But, actually, is possible to use ceph with IPoIB in a stable way or is this experimental ? IPoIB appears as a traditional Ethernet device to Linux and can be used as such. Not exactly, this summer kernel added additional driver for fully featured L2(ib ethernet driver), before that it was quite painful to do any possible failover using ipoib. I assume it is now an EoIB driver. Does it replace the IPoIB driver? I don't know if i support for rsocket that is experimental/untested and IPoIB is a stable workaroud or what else. IPoIB is much more used and pretty stable, while rsockets is new with limited testing. That said, more people using it will help Sean improve it. Ideally, we would like support for zero-copy and reduced CPU usage (via OS-bypass) and with more interconnects than just InfiniBand. :-) And is a dual controller needed on each OSD node? Ceph is able to handle OSD network failures? This is really important to know. It change the whole network topology. I will let others answer this. Scott-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
On Jan 17, 2013, at 10:14 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/17 Atchley, Scott atchle...@ornl.gov: IPoIB appears as a traditional Ethernet device to Linux and can be used as such. Ceph has no idea that it is not Ethernet. Ok. Now it's clear. AFAIK, a standard SDR IB card should give use more speed than GbE (less overhead?) and lower latency, I think. Yes. It should get close to 1 GB/s where 1GbE is limited to about 125 MB/s. Lower latency? Probably since most Ethernet drivers set interrupt coalescing by default. Intel e1000 driver, for example, have a cluster mode that reduces (or turns off) interrupt coalescing. I don't know if ceph is latency sensitive or not. Scott-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Current OSD weight vs. target weight
Hi, if run during a user-issued reweight (i.e. ceph osd crush reweight x y), ceph osd tree shows the target weight of an OSD. Is there a way to see the *current* weight of the OSD? We would like to be able to approximate the amount of rollback necessary per OSD if we need to cancel a larger reweight at some point. This seems impossible, though, without knowing the current weight. Any hints how to do this? Regards, --ck -- filoo GmbH Dr. Christopher Kunz E-Mail: ch...@filoo.de Tel.: (+49) 0 52 41 8 67 30 -18 Fax: (+49) 0 52 41 / 8 67 30 -20 Please sign encrypt mail wherever possible, my key: C882 8ED1 7DD1 9011 C088 EA50 5CFA 2EEB 397A CAC1 Moltkestraße 25a 0 Gütersloh HRB4355, AG Gütersloh Geschäftsführer: S.Grewing, J.Rehpöhler, Dr. C.Kunz Filoo im Web: http://www.filoo.de/ Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh Werden Sie unser Fan auf Facebook: http://facebook.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
On Jan 17, 2013, at 11:01 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/17 Atchley, Scott atchle...@ornl.gov: Yes. It should get close to 1 GB/s where 1GbE is limited to about 125 MB/s. Lower latency? Probably since most Ethernet drivers set interrupt coalescing by default. Intel e1000 driver, for example, have a cluster mode that reduces (or turns off) interrupt coalescing. I don't know if ceph is latency sensitive or not. Sorry, I meant 10GbE. 10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again depends on the Ethernet driver. Scott-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
Hi, Am 17.01.2013 17:12, schrieb Atchley, Scott: On Jan 17, 2013, at 11:01 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/17 Atchley, Scott atchle...@ornl.gov: Yes. It should get close to 1 GB/s where 1GbE is limited to about 125 MB/s. Lower latency? Probably since most Ethernet drivers set interrupt coalescing by default. Intel e1000 driver, for example, have a cluster mode that reduces (or turns off) interrupt coalescing. I don't know if ceph is latency sensitive or not. Sorry, I meant 10GbE. 10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again depends on the Ethernet driver. We're using bonded active/active 2x10GbE with Intel ixgbe and i'm able to get 2.3GB/s. Not sure how to measure latency effectively. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
Hi, Am 17.01.2013 17:21, schrieb Gandalf Corvotempesta: 2013/1/17 Stefan Priebe s.pri...@profihost.ag: We're using bonded active/active 2x10GbE with Intel ixgbe and i'm able to get 2.3GB/s. Which kind of switch do you use? HP 5920 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Ceph slow request unstable issue
On Thu, 17 Jan 2013, Chen, Xiaoxi wrote: Some update summary for tested case till now: Ceph is v0.56.1 1.RBD:Ubuntu 13.04 + 3.7Kernel OSD:Ubuntu 13.04 + 3.7Kernel XFS Result: Kernel Panic on both RBD and OSD sides We're very interested in the RBD client-side kernel panic! I don't think there are known issues with 3.7. 2.RBD:Ubuntu 13.04 +3.2Kernel OSD:Ubuntu 13.04 +3.2Kernel XFS Result:Kernel Panic on RBD( ~15Minus) This less so; we've only backported fixes as far as 3.4. 3.RBD:Ubuntu 13.04 + 3.6.7 Kernel (Suggested by Ceph.com) OSD:Ubuntu 13.04 + 3.2 Kernel XFS Result: Auto-reset on OSD ( ~ 30 mins after the test started) 4.RBD:Ubuntu 13.04+3.6.7 Kernel (Suggested by Ceph.com) OSD:Ubuntu 12.04 + 3.2.0-36 Kernel (Suggested by Ceph.com) XFS Result:auto-reset on OSD ( ~ 30 mins after the test started) These the the weird exit_mm traces shown below? 5.RBD:Ubuntu 13.04+3.6.7 Kernel (Suggested by Ceph.com) OSD:Ubuntu 13.04 +3.6.7 (Suggested by Sage) XFS Result: seems stable for last 1 hour, still running till now Eager to hear how this goes. Thanks! sage Test 34 are repeatable. My test setup OSD side: 3 nodes, 60 Disks(20 per nodes,1 per OSD),10Gb E, 4 *Intel 520 SSD per node as journal,XFS For each node,2 * Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GH + 128GB RAM were used. RBD side: 8 nodes,for each node:10Gb E,2 * Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GH , 128GB RAM Method: Create 240 RBD and mounted to 8 nodes ( 30 RBD per nodes), doing DD concurrently on all 240 RBDs. After ~ 30 minutes, it's likely to have one of the OSD node reset. Ceph OSD logs, syslog and dmesg from reseted node are available if you needed.(It looks to me that no valuable information except a lot of slow-request in OSD's log) Xiaoxi -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013?1?17? 10:35 To: Chen, Xiaoxi Subject: RE: Ceph slow request unstable issue On Thu, 17 Jan 2013, Chen, Xiaoxi wrote: No, on the OSD node, not the same node. OSD node with 3.2 kernel while client node with 3.6 kernel We did suffer kernel panic on rbd client nodes but after upgrade client kernel to 3.6.6 it seems solved . Is it easy to try the 3.6 kernel on the osd nodes too? -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013?1?17? 10:17 To: Chen, Xiaoxi Subject: RE: Ceph slow request unstable issue On Thu, 17 Jan 2013, Chen, Xiaoxi wrote: It is easily to reproduce in my setup... Once I have enough high load on it and waiting for tens of minutes? I can see such log. As a forecast, slow requests more than 30~60s are frequently present in ceph osd's log. Just replied to your other email. Do I understand correctly that you are seeing this problem on the *rbd client* nodes? Or also on the OSDs? Are they the same nodes? sage -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013?1?17? 0:59 To: Andrey Korolyov Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org Subject: Re: Ceph slow request unstable issue Hi, On Wed, 16 Jan 2013, Andrey Korolyov wrote: On Wed, Jan 16, 2013 at 4:58 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: Hi list, We are suffering from OSD or OS down when there is continuing high pressure on the Ceph rack. Basically we are on Ubuntu 12.04+ Ceph 0.56.1, 6 nodes, in each nodes with 20 * spindles + 4* SSDs as journal.(120 spindles in total) We create a lots of RBD volumes (say 240),mounting to 16 different client machines ( 15 RBD Volumes/ client) and running DD concurrently on top of each RBD. The issues are: 1. Slow requests ??From the list-archive it seems solved in 0.56.1 but we still notice such warning 2. OSD Down or even host down Like the message below.Seems some OSD has been blocking there for quite a long time. Suggestions are highly appreciate.Thanks Xiaoxi _ Bad news: I have back all my Ceph machine?s OS to kernel 3.2.0-23, which Ubuntu 12.04 use. I run dd command (dd if=/dev/zero bs=1M count=6 of=/dev/rbd${i} )on Ceph client to create data prepare test at last night. Now, I have one machine down (can?t be reached by ping),
Re: HOWTO: teuthology and code coverage
It's great to see people outside of Inktank starting to get into using teuthology. Thanks for the write-up! -Greg On Wed, Jan 16, 2013 at 6:01 AM, Loic Dachary l...@dachary.org wrote: Hi, I'm happy to report that running teuthology to get a lcov code coverage report worked for me. http://dachary.org/wp-uploads/2013/01/teuthology/total/mon/Monitor.cc.gcov.html It took me a while to figure out the logic (thanks Josh for the help :-). I wrote a HOWTO explaining the steps in detail. It should be straightforward to run on an OpenStack tenant, using virtual machines instead of bare metal. http://dachary.org/?p=1788 Cheers -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
master branch issue in ceph.git
The latest code is hanging trying to start teuthology. I used teuthology-nuke to clear old state and reboot the machines. I was using my branch rebased to latest master and when that started failing I switched to the default config. It still keeps hanging here: INFO:teuthology.task.ceph:Waiting until ceph is healthy... $ ceph -s health HEALTH_WARN 5 pgs degraded; 108 pgs stuck unclean monmap e1: 3 mons at {0=10.214.131.23:6789/0,1=10.214.131.21:6789/0,2=10.214.131.20:6789/0}, election epoch 6, quorum 0,1,2 0,1,2 osdmap e7: 9 osds: 9 up, 9 in pgmap v25: 108 pgs: 103 active+remapped, 5 active+degraded; 0 bytes data, 798 GB used, 3050 GB / 4055 GB avail mdsmap e2: 0/0/0 up David Zafman Senior Developer david.zaf...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: master branch issue in ceph.git
On Thu, 17 Jan 2013, David Zafman wrote: The latest code is hanging trying to start teuthology. I used teuthology-nuke to clear old state and reboot the machines. I was using my branch rebased to latest master and when that started failing I switched to the default config. It still keeps hanging here: INFO:teuthology.task.ceph:Waiting until ceph is healthy... $ ceph -s health HEALTH_WARN 5 pgs degraded; 108 pgs stuck unclean monmap e1: 3 mons at {0=10.214.131.23:6789/0,1=10.214.131.21:6789/0,2=10.214.131.20:6789/0}, election epoch 6, quorum 0,1,2 0,1,2 osdmap e7: 9 osds: 9 up, 9 in pgmap v25: 108 pgs: 103 active+remapped, 5 active+degraded; 0 bytes data, 798 GB used, 3050 GB / 4055 GB avail mdsmap e2: 0/0/0 up git pull the latest teuthology.git master. By default the crush map separates replicas across hosts now, and teuthology needed to be updated to do it by osd instead. Sorry! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html