Re: Ceph version 0.56.1, data loss on power failure

2013-01-17 Thread Yann Dupont

Le 16/01/2013 17:56, Jeff Mitchell a écrit :

FWIW, my ceph data dirs (for e.g. mons) are all on XFS. I've
experienced a lot of corruption on these on power loss to the node --
and in some cases even when power wasn't lost, and the box was simply
rebooted. This is on Ubuntu 12.04 with the ceph-provied 3.6.3 kernel
(as I'm using RBD on these).

It's pretty much to the point where I'm thinking of changing them all
over to ext4 for these data dirs, as the hassle of rebuilding mons
constantly is just not worth the trouble.

In october, I've lost a complete ceph cluster, because of a combination of
a memory management bug in kernel 3.6 + a bug in XFS (another BUG) (I 
Had 12 Nodes, replication was at 2, 5/6 machines were crashed in a row, 
because of mm bug, and 2 ended with unrecoverable corruption)


so, 150 TB of data on the cluster were unrecoverable. Hopefully it was 
only test data.


if you want the gory details see here :

http://oss.sgi.com/archives/xfs/2012-10/msg00420.html

This XFS bug was corrected in 3.0.52, 3.2.34,3.4.19,3.6.7. Dave chinner 
was very quick to fix the problem.


Add the last bug, (journal not flushed properly), not yet fixed on 
latest kernels I can understand your reaction...


But, believe it or not, I'm still confident with XFS. I've been using it 
for more than 10 years on TB and TB of data, and apart those recents 
problems , XFS have been extremely good (stability, performance, crash 
tolerance) all this time.


Not saying ext4 isn't good, but if you follow kernel developpement, 
you'll see that it's not bug-free either...


And not speaking of btrfs which was totally unstable with ceph on my 
last tries (6 month ago)


In fact, ceph is hammering hardware strongly, so it's very good to find 
bugs in linux kernel :)



So, for the moment, i'm sticking with 3.4.25 kernel. Longterm kernel, 
proven, stable : no mm problems, no xfs problems.



Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Single host VM limit when using RBD

2013-01-17 Thread Matthew Anderson
I've run into a limit on the maximum number of RBD backed VM's that I'm able to 
run on a single host. I have 20 VM's (21 RBD volumes open) running on a single 
host and when booting the 21st machine I get the below error from libvirt/QEMU. 
I'm able to shut down a VM and start another in it's place so there seems to be 
a hard limit on the amount of volumes I'm able to have open.  I did some 
googling and the error 11 from pthread_create seems to mean 'resource 
unavailable' so I'm probably running into a thread limit of some sort. I did 
try increasing the max_thread kernel option but nothing changed. I moved a few 
VM's to a different empty host and they start with no issues at all.

This machine has 4 OSD's running on it in addition to the 20 VM's. Kernel 
3.7.1. Ceph 0.56.1 and QEMU 1.3.0. There is currently 65GB of 96GB free ram and 
no swap. 

Can anyone suggest where the limit might be or anything I can do to narrow down 
the problem?

Thanks
-Matt
-

Error starting domain: internal error Process exited while reading console log 
output: char device redirected to /dev/pts/23
Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In 
function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 
02:32:58.096437
common/Thread.cc: 110: FAILED assert(ret == 0)
ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
1: (()+0x2aaa8f) [0x7f4eb2de8a8f]
2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575]
3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc]
4: (()+0xa0290) [0x7f4eb5b27290]
5: (()+0x879dd) [0x7f4eb5b0e9dd]
6: (()+0x87c1b) [0x7f4eb5b0ec1b]
7: (()+0x87ae1) [0x7f4eb5b0eae1]
8: (()+0x87d50) [0x7f4eb5b0ed50]
9: (()+0xb37b2) [0x7f4eb5b3a7b2]
10: (()+0x1e83eb) [0x7f4eb5c6f3eb]
11: (()+0x1ab54a) [0x7f4eb5c3254a]
12: (main()+0x9da) [0x7f4eb5c72a3a]
13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd]
14: (()+0x710b9) [0x7f4eb5af80b9]
NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.
terminate called after

Traceback (most recent call last):
  File /usr/share/virt-manager/virtManager/asyncjob.py, line 96, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File /usr/share/virt-manager/virtManager/asyncjob.py, line 117, in tmpcb
    callback(*args, **kwargs)
  File /usr/share/virt-manager/virtManager/domain.py, line 1090, in startup
    self._backend.create()
  File /usr/lib/python2.7/dist-packages/libvirt.py, line 620, in create
    if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirtError: internal error Process exited while reading console log output: 
char device redirected to /dev/pts/23
Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In 
function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 
02:32:58.096437
common/Thread.cc: 110: FAILED assert(ret == 0)
ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
1: (()+0x2aaa8f) [0x7f4eb2de8a8f]
2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575]
3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc]
4: (()+0xa0290) [0x7f4eb5b27290]
5: (()+0x879dd) [0x7f4eb5b0e9dd]
6: (()+0x87c1b) [0x7f4eb5b0ec1b]
7: (()+0x87ae1) [0x7f4eb5b0eae1]
8: (()+0x87d50) [0x7f4eb5b0ed50]
9: (()+0xb37b2) [0x7f4eb5b3a7b2]
10: (()+0x1e83eb) [0x7f4eb5c6f3eb]
11: (()+0x1ab54a) [0x7f4eb5c3254a]
12: (main()+0x9da) [0x7f4eb5c72a3a]
13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd]
14: (()+0x710b9) [0x7f4eb5af80b9]
NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.
terminate called after

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Single host VM limit when using RBD

2013-01-17 Thread Andrey Korolyov
Hi Matthew,

Seems to a low value in /proc/sys/kernel/threads-max value.

On Thu, Jan 17, 2013 at 12:37 PM, Matthew Anderson
matth...@base3.com.au wrote:
 I've run into a limit on the maximum number of RBD backed VM's that I'm able 
 to run on a single host. I have 20 VM's (21 RBD volumes open) running on a 
 single host and when booting the 21st machine I get the below error from 
 libvirt/QEMU. I'm able to shut down a VM and start another in it's place so 
 there seems to be a hard limit on the amount of volumes I'm able to have 
 open.  I did some googling and the error 11 from pthread_create seems to mean 
 'resource unavailable' so I'm probably running into a thread limit of some 
 sort. I did try increasing the max_thread kernel option but nothing changed. 
 I moved a few VM's to a different empty host and they start with no issues at 
 all.

 This machine has 4 OSD's running on it in addition to the 20 VM's. Kernel 
 3.7.1. Ceph 0.56.1 and QEMU 1.3.0. There is currently 65GB of 96GB free ram 
 and no swap.

 Can anyone suggest where the limit might be or anything I can do to narrow 
 down the problem?

 Thanks
 -Matt
 -

 Error starting domain: internal error Process exited while reading console 
 log output: char device redirected to /dev/pts/23
 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In 
 function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 
 02:32:58.096437
 common/Thread.cc: 110: FAILED assert(ret == 0)
 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (()+0x2aaa8f) [0x7f4eb2de8a8f]
 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575]
 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc]
 4: (()+0xa0290) [0x7f4eb5b27290]
 5: (()+0x879dd) [0x7f4eb5b0e9dd]
 6: (()+0x87c1b) [0x7f4eb5b0ec1b]
 7: (()+0x87ae1) [0x7f4eb5b0eae1]
 8: (()+0x87d50) [0x7f4eb5b0ed50]
 9: (()+0xb37b2) [0x7f4eb5b3a7b2]
 10: (()+0x1e83eb) [0x7f4eb5c6f3eb]
 11: (()+0x1ab54a) [0x7f4eb5c3254a]
 12: (main()+0x9da) [0x7f4eb5c72a3a]
 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd]
 14: (()+0x710b9) [0x7f4eb5af80b9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 terminate called after

 Traceback (most recent call last):
   File /usr/share/virt-manager/virtManager/asyncjob.py, line 96, in 
 cb_wrapper
 callback(asyncjob, *args, **kwargs)
   File /usr/share/virt-manager/virtManager/asyncjob.py, line 117, in tmpcb
 callback(*args, **kwargs)
   File /usr/share/virt-manager/virtManager/domain.py, line 1090, in startup
 self._backend.create()
   File /usr/lib/python2.7/dist-packages/libvirt.py, line 620, in create
 if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
 libvirtError: internal error Process exited while reading console log output: 
 char device redirected to /dev/pts/23
 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In 
 function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 
 02:32:58.096437
 common/Thread.cc: 110: FAILED assert(ret == 0)
 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (()+0x2aaa8f) [0x7f4eb2de8a8f]
 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575]
 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc]
 4: (()+0xa0290) [0x7f4eb5b27290]
 5: (()+0x879dd) [0x7f4eb5b0e9dd]
 6: (()+0x87c1b) [0x7f4eb5b0ec1b]
 7: (()+0x87ae1) [0x7f4eb5b0eae1]
 8: (()+0x87d50) [0x7f4eb5b0ed50]
 9: (()+0xb37b2) [0x7f4eb5b3a7b2]
 10: (()+0x1e83eb) [0x7f4eb5c6f3eb]
 11: (()+0x1ab54a) [0x7f4eb5c3254a]
 12: (main()+0x9da) [0x7f4eb5c72a3a]
 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd]
 14: (()+0x710b9) [0x7f4eb5af80b9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 terminate called after

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Single host VM limit when using RBD

2013-01-17 Thread Matthew Anderson
Hi Audrey,

I did try your suggestion beforehand and it doesn't appear to fix the issue. 

[root@KVM04 ~]# cat  /proc/sys/kernel/threads-max 
2549635
[root@KVM04 ~]# echo 5549635  /proc/sys/kernel/threads-max
[root@KVM04 ~]# virsh start EX03
error: Failed to start domain EX03
error: internal error Process exited while reading console log output: char 
device redirected to /dev/pts/23
Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In 
function 'void Thread::create(size_t)' thread 7f5ec9706960 time 2013-01-17 
16:46:50.935681
common/Thread.cc: 110: FAILED assert(ret == 0)
 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (()+0x2aaa8f) [0x7f5ec6a89a8f]
 2: (SafeTimer::init()+0x95) [0x7f5ec6973575]
 3: (librados::RadosClient::connect()+0x72c) [0x7f5ec69099dc]
 4: (()+0xa0290) [0x7f5ec97c8290]
 5: (()+0x879dd) [0x7f5ec97af9dd]
 6: (()+0x87c1b) [0x7f5ec97afc1b]
 7: (()+0x87ae1) [0x7f5ec97afae1]
 8: (()+0x87d50) [0x7f5ec97afd50]
 9: (()+0xb37b2) [0x7f5ec97db7b2]
 10: (()+0x1e83eb) [0x7f5ec99103eb]
 11: (()+0x1ab54a) [0x7f5ec98d354a]
 12: (main()+0x9da) [0x7f5ec9913a3a]
 13: (__libc_start_main()+0xfd) [0x7f5ec5755cdd]
 14: (()+0x710b9) [0x7f5ec97990b9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.
terminate called after
    


-Original Message-
From: Andrey Korolyov [mailto:and...@xdel.ru] 
Sent: Thursday, 17 January 2013 4:42 PM
To: Matthew Anderson
Cc: ceph-devel@vger.kernel.org
Subject: Re: Single host VM limit when using RBD

Hi Matthew,

Seems to a low value in /proc/sys/kernel/threads-max value.

On Thu, Jan 17, 2013 at 12:37 PM, Matthew Anderson matth...@base3.com.au 
wrote:
 I've run into a limit on the maximum number of RBD backed VM's that I'm able 
 to run on a single host. I have 20 VM's (21 RBD volumes open) running on a 
 single host and when booting the 21st machine I get the below error from 
 libvirt/QEMU. I'm able to shut down a VM and start another in it's place so 
 there seems to be a hard limit on the amount of volumes I'm able to have 
 open.  I did some googling and the error 11 from pthread_create seems to mean 
 'resource unavailable' so I'm probably running into a thread limit of some 
 sort. I did try increasing the max_thread kernel option but nothing changed. 
 I moved a few VM's to a different empty host and they start with no issues at 
 all.

 This machine has 4 OSD's running on it in addition to the 20 VM's. Kernel 
 3.7.1. Ceph 0.56.1 and QEMU 1.3.0. There is currently 65GB of 96GB free ram 
 and no swap.

 Can anyone suggest where the limit might be or anything I can do to narrow 
 down the problem?

 Thanks
 -Matt
 -

 Error starting domain: internal error Process exited while reading 
 console log output: char device redirected to /dev/pts/23
 Thread::try_create(): pthread_create failed with error 
 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 
 7f4eb5a65960 time 2013-01-17 02:32:58.096437
 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 
 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (()+0x2aaa8f) [0x7f4eb2de8a8f]
 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575]
 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc]
 4: (()+0xa0290) [0x7f4eb5b27290]
 5: (()+0x879dd) [0x7f4eb5b0e9dd]
 6: (()+0x87c1b) [0x7f4eb5b0ec1b]
 7: (()+0x87ae1) [0x7f4eb5b0eae1]
 8: (()+0x87d50) [0x7f4eb5b0ed50]
 9: (()+0xb37b2) [0x7f4eb5b3a7b2]
 10: (()+0x1e83eb) [0x7f4eb5c6f3eb]
 11: (()+0x1ab54a) [0x7f4eb5c3254a]
 12: (main()+0x9da) [0x7f4eb5c72a3a]
 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd]
 14: (()+0x710b9) [0x7f4eb5af80b9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 terminate called after

 Traceback (most recent call last):
   File /usr/share/virt-manager/virtManager/asyncjob.py, line 96, in 
 cb_wrapper
 callback(asyncjob, *args, **kwargs)
   File /usr/share/virt-manager/virtManager/asyncjob.py, line 117, in tmpcb
 callback(*args, **kwargs)
   File /usr/share/virt-manager/virtManager/domain.py, line 1090, in startup
 self._backend.create()
   File /usr/lib/python2.7/dist-packages/libvirt.py, line 620, in create
 if ret == -1: raise libvirtError ('virDomainCreate() failed', 
 dom=self)
 libvirtError: internal error Process exited while reading console log 
 output: char device redirected to /dev/pts/23
 Thread::try_create(): pthread_create failed with error 
 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 
 7f4eb5a65960 time 2013-01-17 02:32:58.096437
 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 
 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (()+0x2aaa8f) [0x7f4eb2de8a8f]
 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575]
 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc]
 4: (()+0xa0290) [0x7f4eb5b27290]
 5: (()+0x879dd) [0x7f4eb5b0e9dd]
 6: (()+0x87c1b) [0x7f4eb5b0ec1b]
 7: (()+0x87ae1) [0x7f4eb5b0eae1]

Re: flashcache

2013-01-17 Thread Joseph Glanville
On 17 January 2013 20:46, Gandalf Corvotempesta
gandalf.corvotempe...@gmail.com wrote:
 2013/1/16 Mark Nelson mark.nel...@inktank.com:

 I don't know if I have to use a single two port IB card (switch
 redundancy and no card redundancy) or
 I have to use two single port cards. (or a single one port IB?)

On the topic of IB..

But slightly off-topic all the same..  I would love to attempt getting
Ceph running on rsockets if I could find the time (alas we don't run
Ceph).
rsockets is a fully userland implementation of BSD sockets over RDMA,
supporting fork and all the usual goodies, in theory unless you are
using the kernel RBD module (of the kernel FS module etc) you should
be able to run it on rsockets and enjoy a considerable performance
increase.

rsockets is available in the librdmacm git up on Open Fabrics and dev
+ support happens on the linux-rdma list.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: flashcache

2013-01-17 Thread Mark Nelson

On 01/16/2013 11:47 PM, Stefan Priebe - Profihost AG wrote:

Hi Mark,

Am 16.01.2013 um 22:53 schrieb Mark

With only 2 SSDs for 12 spinning disks, you'll need to make sure the SSDs are 
really fast.  I use Intel 520s for testing which are great, but I wouldn't use 
them in  production.


Why not? I use them for a ssd only ceph cluster.

Stefan


It's pretty tough to get an apples-to-apples comparison of endurance 
when looking at the Intel 520 vs something like the DC S3700. If I were 
actually building out a production deployment I'd probably stick with 
the DC S3700 (especially if sticking journals, flashcache, and xfs 
journals for 6 osds on 1 drive!).  There's probably a reasonable 
endurance per cost argument for a severely under-subscribed 520 (or 
other similar drive) as well.  It'd be an interesting study to look at 
how long it takes a small enterpise drives to die vs a larger 
under-subscribed consumer drive.


--
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: flashcache

2013-01-17 Thread Mark Nelson

On 01/17/2013 07:32 AM, Joseph Glanville wrote:

On 17 January 2013 20:46, Gandalf Corvotempesta
gandalf.corvotempe...@gmail.com  wrote:

2013/1/16 Mark Nelsonmark.nel...@inktank.com:



I don't know if I have to use a single two port IB card (switch
redundancy and no card redundancy) or
I have to use two single port cards. (or a single one port IB?)


On the topic of IB..

But slightly off-topic all the same..  I would love to attempt getting
Ceph running on rsockets if I could find the time (alas we don't run
Ceph).
rsockets is a fully userland implementation of BSD sockets over RDMA,
supporting fork and all the usual goodies, in theory unless you are
using the kernel RBD module (of the kernel FS module etc) you should
be able to run it on rsockets and enjoy a considerable performance
increase.

rsockets is available in the librdmacm git up on Open Fabrics and dev
+ support happens on the linux-rdma list.



There's been some talk about rsockets on the list before.  I think there 
are a couple of different folks that have tried (succeeded?) in getting 
it working.  barring that, it sounds like if you tune interrupt affinity 
settings and various other bits you can get IPoIB up into the 2GB/s+ 
range which while not RDMA speed, is at least better than 10GbE.


--
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hit suicide timeout after adding new osd

2013-01-17 Thread Jens Kristian Søgaard

Hi guys,

I had a functioning Ceph system that reported HEALTH_OK. It was running 
with 3 osds on 3 servers.


Then I added an extra osd on 1 of the servers using the commands from 
the documentation here:


http://ceph.com/docs/master/rados/operations/add-or-rm-osds/

Shortly after I did that 2 of the existing osds crashed.

I restarted them and after some hours they were up and running again, 
but soon one of them crashed again - and a third existing osd crashed as 
well. I restarted those two and waited some hours for them to come up. A 
short while later one of them crashed again.


I have then restarted restarted that last one and watched the logs 
closely. It seems the same patterns repeats itself every time. It starts 
up doing its normal maintenance before going up (takes a long while). 
Then it seems to be running, but logs the following every 5 seconds:


heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had timed 
out after 30


After some time it logs:

===
heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had suicide 
timed out after 300


2013-01-17 15:24:35.051524 7f053f149700 -1 common/HeartbeatMap.cc: In 
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, 
const char*, time_t)' thread 7f053f149700 time 2013-01-17 15:24:33.849654

common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout)

 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, 
long)+0x2eb) [0x8462bb]

 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x846a9e]
 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x846cc8]
 4: (CephContextServiceThread::entry()+0x55) [0x8e01c5]
 5: /lib64/libpthread.so.0() [0x360de07d14]
 6: (clone()+0x6d) [0x360d6f167d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


2013-01-17 15:24:35.301183 7f053f149700 -1 *** Caught signal (Aborted) **
 in thread 7f053f149700

 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: /usr/bin/ceph-osd() [0x82ea90]
 2: /lib64/libpthread.so.0() [0x360de0efe0]
 3: (gsignal()+0x35) [0x360d635925]
 4: (abort()+0x148) [0x360d6370d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x3611660dad]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.

===

How can I avoid this? - is it a bug, or have I done something wrong?

I'm running Ceph 0.56.1 from the official RPMs on Fedora 17.
The underlying disks and network connectivity has been tested and 
nothing seems to be wrong there.


Thanks in advance for your assistance!
--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: flashcache

2013-01-17 Thread Atchley, Scott
On Jan 17, 2013, at 8:37 AM, Mark Nelson mark.nel...@inktank.com wrote:

 On 01/17/2013 07:32 AM, Joseph Glanville wrote:
 On 17 January 2013 20:46, Gandalf Corvotempesta
 gandalf.corvotempe...@gmail.com  wrote:
 2013/1/16 Mark Nelsonmark.nel...@inktank.com:
 
 I don't know if I have to use a single two port IB card (switch
 redundancy and no card redundancy) or
 I have to use two single port cards. (or a single one port IB?)
 
 On the topic of IB..
 
 But slightly off-topic all the same..  I would love to attempt getting
 Ceph running on rsockets if I could find the time (alas we don't run
 Ceph).
 rsockets is a fully userland implementation of BSD sockets over RDMA,
 supporting fork and all the usual goodies, in theory unless you are
 using the kernel RBD module (of the kernel FS module etc) you should
 be able to run it on rsockets and enjoy a considerable performance
 increase.
 
 rsockets is available in the librdmacm git up on Open Fabrics and dev
 + support happens on the linux-rdma list.
 
 
 There's been some talk about rsockets on the list before.  I think there 
 are a couple of different folks that have tried (succeeded?) in getting 
 it working.  barring that, it sounds like if you tune interrupt affinity 
 settings and various other bits you can get IPoIB up into the 2GB/s+ 
 range which while not RDMA speed, is at least better than 10GbE.

IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR 
PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it 
uses the traditional socket stack through the kernel, CPU usage will be as high 
(or higher if QDR) than 10GbE.

I would be interested in seeing if rsockets helps ceph. I have questions though.

By default, rsockets still has to copy data in and out. It has extensions for 
zero-copy, but they do not work with non-blocking sockets. Does ceph use 
non-blocking sockets?

How many simultaneous connections does it support before falling over into the 
non-rsockets path?

If rsockets has a ceiling on connections and falls over to the non-socket path, 
can the application determine if it is safe to use the zero-copy extensions for 
a specific connection or do they fail gracefully?

Scott--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hit suicide timeout after adding new osd

2013-01-17 Thread Wido den Hollander

Hi,

On 01/17/2013 03:35 PM, Jens Kristian Søgaard wrote:

Hi guys,

I had a functioning Ceph system that reported HEALTH_OK. It was running
with 3 osds on 3 servers.

Then I added an extra osd on 1 of the servers using the commands from
the documentation here:

http://ceph.com/docs/master/rados/operations/add-or-rm-osds/

Shortly after I did that 2 of the existing osds crashed.

I restarted them and after some hours they were up and running again,
but soon one of them crashed again - and a third existing osd crashed as
well. I restarted those two and waited some hours for them to come up. A
short while later one of them crashed again.

I have then restarted restarted that last one and watched the logs
closely. It seems the same patterns repeats itself every time. It starts
up doing its normal maintenance before going up (takes a long while).
Then it seems to be running, but logs the following every 5 seconds:

heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had timed
out after 30

After some time it logs:

===
heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had suicide
timed out after 300

2013-01-17 15:24:35.051524 7f053f149700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
const char*, time_t)' thread 7f053f149700 time 2013-01-17 15:24:33.849654
common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout)

  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
long)+0x2eb) [0x8462bb]
  2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x846a9e]
  3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x846cc8]
  4: (CephContextServiceThread::entry()+0x55) [0x8e01c5]
  5: /lib64/libpthread.so.0() [0x360de07d14]
  6: (clone()+0x6d) [0x360d6f167d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

2013-01-17 15:24:35.301183 7f053f149700 -1 *** Caught signal (Aborted) **
  in thread 7f053f149700

  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
  1: /usr/bin/ceph-osd() [0x82ea90]
  2: /lib64/libpthread.so.0() [0x360de0efe0]
  3: (gsignal()+0x35) [0x360d635925]
  4: (abort()+0x148) [0x360d6370d8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x3611660dad]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.
===

How can I avoid this? - is it a bug, or have I done something wrong?



I think you are seeing the same issue as I noticed about two weeks ago: 
http://www.spinics.net/lists/ceph-devel/msg11328.html


See this issue: http://tracker.newdream.net/issues/3714

I can't find branch wip-3714 anymore, so it might be already merged into 
next.


You might want to try building from 'next' yourself or fetch some new 
packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/


Wido



I'm running Ceph 0.56.1 from the official RPMs on Fedora 17.
The underlying disks and network connectivity has been tested and
nothing seems to be wrong there.

Thanks in advance for your assistance!


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: flashcache

2013-01-17 Thread Andrey Korolyov
On Thu, Jan 17, 2013 at 7:00 PM, Atchley, Scott atchle...@ornl.gov wrote:
 On Jan 17, 2013, at 9:48 AM, Gandalf Corvotempesta 
 gandalf.corvotempe...@gmail.com wrote:

 2013/1/17 Atchley, Scott atchle...@ornl.gov:
 IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR 
 PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since 
 it uses the traditional socket stack through the kernel, CPU usage will be 
 as high (or higher if QDR) than 10GbE.

 Which kind of tuning? Do you have a paper about this?

 No, I followed the Mellanox tuning guide and modified their interrupt 
 affinity scripts.

Did you tried to bind interrupts only to core to which QPI link
belongs in reality and measure difference with spread-over-all-cores
binding?


 But, actually, is possible to use ceph with IPoIB in a stable way or
 is this experimental ?

 IPoIB appears as a traditional Ethernet device to Linux and can be used as 
 such.

Not exactly, this summer kernel added additional driver for fully
featured L2(ib ethernet driver), before that it was quite painful to
do any possible failover using ipoib.


 I don't know if i support for rsocket that is experimental/untested
 and IPoIB is a stable workaroud or what else.

 IPoIB is much more used and pretty stable, while rsockets is new with limited 
 testing. That said, more people using it will help Sean improve it.

 Ideally, we would like support for zero-copy and reduced CPU usage (via 
 OS-bypass) and with more interconnects than just InfiniBand. :-)

 And is a dual controller needed on each OSD node? Ceph is able to
 handle OSD network failures? This is really important to know. It
 change the whole network topology.

 I will let others answer this.

 Scott--
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hit suicide timeout after adding new osd

2013-01-17 Thread Wido den Hollander

Hi,

On 01/17/2013 03:50 PM, Stefan Priebe wrote:

Hi,

Am 17.01.2013 15:47, schrieb Wido den Hollander:

You might want to try building from 'next' yourself or fetch some new
packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/


Should it be backported to bobtail branch as well?



Don't think it's in there. I found the commit in the next branch: 
https://github.com/ceph/ceph/commit/7e94f6f1a7b7a865433edacd6a521f6ea1170eac


Doesn't seem to be in 'bobtail'.

Wido


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hit suicide timeout after adding new osd

2013-01-17 Thread Stefan Priebe

Hi Sage,

Am 17.01.2013 16:33, schrieb Wido den Hollander:

Hi,
On 01/17/2013 03:50 PM, Stefan Priebe wrote:

Hi,
Am 17.01.2013 15:47, schrieb Wido den Hollander:

You might want to try building from 'next' yourself or fetch some new
packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/


Should it be backported to bobtail branch as well?



Don't think it's in there. I found the commit in the next branch:
https://github.com/ceph/ceph/commit/7e94f6f1a7b7a865433edacd6a521f6ea1170eac

Doesn't seem to be in 'bobtail'.

Wido


Was this forgotten? I mean a lot of people have posted about git suicide 
timeout with 0.56 oder 0.56.1 so it should be also in the bobtail branch?


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: flashcache

2013-01-17 Thread Atchley, Scott
On Jan 17, 2013, at 10:07 AM, Andrey Korolyov and...@xdel.ru wrote:

 On Thu, Jan 17, 2013 at 7:00 PM, Atchley, Scott atchle...@ornl.gov wrote:
 On Jan 17, 2013, at 9:48 AM, Gandalf Corvotempesta 
 gandalf.corvotempe...@gmail.com wrote:
 
 2013/1/17 Atchley, Scott atchle...@ornl.gov:
 IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR 
 PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. 
 Since it uses the traditional socket stack through the kernel, CPU usage 
 will be as high (or higher if QDR) than 10GbE.
 
 Which kind of tuning? Do you have a paper about this?
 
 No, I followed the Mellanox tuning guide and modified their interrupt 
 affinity scripts.
 
 Did you tried to bind interrupts only to core to which QPI link
 belongs in reality and measure difference with spread-over-all-cores
 binding?

This is the modified part. I bound the mlx4-async handler to core 0 and the 
mlx4-ib-1-0 handle to core 1 for our machines.

 But, actually, is possible to use ceph with IPoIB in a stable way or
 is this experimental ?
 
 IPoIB appears as a traditional Ethernet device to Linux and can be used as 
 such.
 
 Not exactly, this summer kernel added additional driver for fully
 featured L2(ib ethernet driver), before that it was quite painful to
 do any possible failover using ipoib.

I assume it is now an EoIB driver. Does it replace the IPoIB driver?

 I don't know if i support for rsocket that is experimental/untested
 and IPoIB is a stable workaroud or what else.
 
 IPoIB is much more used and pretty stable, while rsockets is new with 
 limited testing. That said, more people using it will help Sean improve it.
 
 Ideally, we would like support for zero-copy and reduced CPU usage (via 
 OS-bypass) and with more interconnects than just InfiniBand. :-)
 
 And is a dual controller needed on each OSD node? Ceph is able to
 handle OSD network failures? This is really important to know. It
 change the whole network topology.
 
 I will let others answer this.
 
 Scott--
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: flashcache

2013-01-17 Thread Atchley, Scott
On Jan 17, 2013, at 10:14 AM, Gandalf Corvotempesta 
gandalf.corvotempe...@gmail.com wrote:

 2013/1/17 Atchley, Scott atchle...@ornl.gov:
 IPoIB appears as a traditional Ethernet device to Linux and can be used as 
 such. Ceph has no idea that it is not Ethernet.
 
 Ok. Now it's clear.
 AFAIK, a standard SDR IB card should give use more speed than GbE
 (less overhead?) and lower latency, I think.

Yes. It should get close to 1 GB/s where 1GbE is limited to about 125 MB/s. 
Lower latency? Probably since most Ethernet drivers set interrupt coalescing by 
default. Intel e1000 driver, for example, have a cluster mode that reduces (or 
turns off) interrupt coalescing. I don't know if ceph is latency sensitive or 
not.

Scott--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Current OSD weight vs. target weight

2013-01-17 Thread Christopher Kunz
Hi,

if run during a user-issued reweight (i.e. ceph osd crush reweight x
y), ceph osd tree shows the target weight of an OSD. Is there a way
to see the *current* weight of the OSD?

We would like to be able to approximate the amount of rollback necessary
per OSD if we need to cancel a larger reweight at some point. This seems
impossible, though, without knowing the current weight.

Any hints how to do this?

Regards,

--ck

-- 
filoo GmbH
Dr. Christopher Kunz

E-Mail: ch...@filoo.de
Tel.: (+49) 0 52 41 8 67 30 -18
Fax: (+49) 0 52 41 / 8 67 30 -20

Please sign  encrypt mail wherever possible, my key:
C882 8ED1 7DD1 9011 C088 EA50 5CFA 2EEB 397A CAC1

Moltkestraße 25a
0 Gütersloh

HRB4355, AG Gütersloh
Geschäftsführer: S.Grewing, J.Rehpöhler, Dr. C.Kunz

Filoo im Web: http://www.filoo.de/
Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
Werden Sie unser Fan auf Facebook: http://facebook.com/filoogmbh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: flashcache

2013-01-17 Thread Atchley, Scott
On Jan 17, 2013, at 11:01 AM, Gandalf Corvotempesta 
gandalf.corvotempe...@gmail.com wrote:

 2013/1/17 Atchley, Scott atchle...@ornl.gov:
 Yes. It should get close to 1 GB/s where 1GbE is limited to about 125 MB/s. 
 Lower latency? Probably since most Ethernet drivers set interrupt coalescing 
 by default. Intel e1000 driver, for example, have a cluster mode that 
 reduces (or turns off) interrupt coalescing. I don't know if ceph is latency 
 sensitive or not.
 
 Sorry, I meant 10GbE.

10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again 
depends on the Ethernet driver.

Scott--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: flashcache

2013-01-17 Thread Stefan Priebe

Hi,

Am 17.01.2013 17:12, schrieb Atchley, Scott:

On Jan 17, 2013, at 11:01 AM, Gandalf Corvotempesta 
gandalf.corvotempe...@gmail.com wrote:


2013/1/17 Atchley, Scott atchle...@ornl.gov:

Yes. It should get close to 1 GB/s where 1GbE is limited to about 125 MB/s. 
Lower latency? Probably since most Ethernet drivers set interrupt coalescing by 
default. Intel e1000 driver, for example, have a cluster mode that reduces (or 
turns off) interrupt coalescing. I don't know if ceph is latency sensitive or 
not.


Sorry, I meant 10GbE.


10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again 
depends on the Ethernet driver.


We're using bonded active/active 2x10GbE with Intel ixgbe and i'm able 
to get 2.3GB/s.


Not sure how to measure latency effectively.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: flashcache

2013-01-17 Thread Stefan Priebe

Hi,

Am 17.01.2013 17:21, schrieb Gandalf Corvotempesta:

2013/1/17 Stefan Priebe s.pri...@profihost.ag:

We're using bonded active/active 2x10GbE with Intel ixgbe and i'm able to
get 2.3GB/s.


Which kind of switch do you use?


HP 5920

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Ceph slow request unstable issue

2013-01-17 Thread Sage Weil
On Thu, 17 Jan 2013, Chen, Xiaoxi wrote:
 Some update summary for tested case till now:
 Ceph is v0.56.1
 
 1.RBD:Ubuntu 13.04 + 3.7Kernel 
   OSD:Ubuntu 13.04 + 3.7Kernel
   XFS
 
   Result: Kernel Panic on both RBD and OSD sides

We're very interested in the RBD client-side kernel panic!  I don't think 
there are known issues with 3.7.

 2.RBD:Ubuntu 13.04 +3.2Kernel
   OSD:Ubuntu 13.04 +3.2Kernel
   XFS
   
   Result:Kernel Panic on RBD( ~15Minus)

This less so; we've only backported fixes as far as 3.4.

 3.RBD:Ubuntu 13.04 + 3.6.7 Kernel (Suggested by Ceph.com)
   OSD:Ubuntu 13.04 + 3.2   Kernel 
   XFS
 
   Result: Auto-reset on OSD ( ~ 30 mins after the test started)
 
 4.RBD:Ubuntu 13.04+3.6.7 Kernel (Suggested by Ceph.com)
   OSD:Ubuntu 12.04 + 3.2.0-36 Kernel (Suggested by Ceph.com)
   XFS
   
   Result:auto-reset on OSD ( ~ 30 mins after the test started)

These the the weird exit_mm traces shown below?

 5.RBD:Ubuntu 13.04+3.6.7 Kernel (Suggested by Ceph.com)
   OSD:Ubuntu 13.04 +3.6.7 (Suggested by Sage)
   XFS
 
   Result: seems stable for last 1 hour, still running till now

Eager to hear how this goes.

Thanks!
sage


 
 
 Test 34 are repeatable.
 My test setup 
 OSD side:
   3 nodes, 60 Disks(20 per nodes,1 per OSD),10Gb E, 4 *Intel 520 SSD per node 
 as journal,XFS
   For each node,2 * Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GH + 128GB RAM were 
 used.
 RBD side:
   8 nodes,for each node:10Gb E,2 * Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GH , 
 128GB RAM
 
 Method:
   Create 240 RBD and mounted to 8 nodes ( 30 RBD per nodes), doing DD 
 concurrently on all 240 RBDs.
 
   After ~ 30 minutes, it's likely to have one of the OSD node reset.
 
 Ceph OSD logs, syslog and dmesg from reseted node are available if you 
 needed.(It looks to me that no valuable information except a lot of 
 slow-request in OSD's log)
 
 
   
 Xiaoxi
 
 
 -Original Message-
 From: Sage Weil [mailto:s...@inktank.com] 
 Sent: 2013?1?17? 10:35
 To: Chen, Xiaoxi
 Subject: RE: Ceph slow request  unstable issue
 
 On Thu, 17 Jan 2013, Chen, Xiaoxi wrote:
  No, on the OSD node, not the same node. OSD node with 3.2 kernel while 
  client node with 3.6 kernel
  
  We did suffer kernel panic on rbd client nodes but after upgrade 
  client kernel to 3.6.6 it seems solved .
 
 Is it easy to try the 3.6 kernel on the osd nodes too?
 
 
  
  
  -Original Message-
  From: Sage Weil [mailto:s...@inktank.com]
  Sent: 2013?1?17? 10:17
  To: Chen, Xiaoxi
  Subject: RE: Ceph slow request  unstable issue
  
  On Thu, 17 Jan 2013, Chen, Xiaoxi wrote:
   It is easily to reproduce in my setup...
   Once I have enough high load on it and waiting for tens of minutes? I can 
   see such log.
   As a forecast, slow requests more than 30~60s  are frequently present 
   in ceph osd's log.
  
  Just replied to your other email.  Do I understand correctly that you are 
  seeing this problem on the *rbd client* nodes?  Or also on the OSDs?  Are 
  they the same nodes?
  
  sage
  
   
   -Original Message-
   From: Sage Weil [mailto:s...@inktank.com]
   Sent: 2013?1?17? 0:59
   To: Andrey Korolyov
   Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org
   Subject: Re: Ceph slow request  unstable issue
   
   Hi,
   
   On Wed, 16 Jan 2013, Andrey Korolyov wrote:
On Wed, Jan 16, 2013 at 4:58 AM, Chen, Xiaoxi xiaoxi.c...@intel.com 
wrote:
 Hi list,
 We are suffering from OSD or OS down when there is continuing 
 high pressure on the Ceph rack.
 Basically we are on Ubuntu 12.04+ Ceph 0.56.1, 6 nodes, in 
 each nodes with 20 * spindles + 4* SSDs as journal.(120 spindles in 
 total)
 We create a lots of RBD volumes (say 240),mounting to 16 
 different client machines ( 15 RBD Volumes/ client) and running DD 
 concurrently on top of each RBD.

 The issues are:
 1. Slow requests
 ??From the list-archive it seems solved in 0.56.1 but we still 
 notice such warning 2. OSD Down or even host down Like the 
 message below.Seems some OSD has been blocking there for quite a long 
 time.

 Suggestions are highly appreciate.Thanks
   
   
 
 Xiaoxi

 _

 Bad news:

 I have  back all my Ceph machine?s OS to kernel  3.2.0-23, which 
 Ubuntu 12.04 use.
 I run dd command (dd if=/dev/zero bs=1M count=6 of=/dev/rbd${i}  
 )on Ceph client to create data prepare test at last night.
 Now, I have one machine down (can?t be reached by ping), 

Re: HOWTO: teuthology and code coverage

2013-01-17 Thread Gregory Farnum
It's great to see people outside of Inktank starting to get into using
teuthology. Thanks for the write-up!
-Greg

On Wed, Jan 16, 2013 at 6:01 AM, Loic Dachary l...@dachary.org wrote:
 Hi,

 I'm happy to report that running teuthology to get a lcov code coverage 
 report worked for me.

 http://dachary.org/wp-uploads/2013/01/teuthology/total/mon/Monitor.cc.gcov.html

 It took me a while to figure out the logic (thanks Josh for the help :-). I 
 wrote a HOWTO explaining the steps in detail. It should be straightforward to 
 run on an OpenStack tenant, using virtual machines instead of bare metal.

 http://dachary.org/?p=1788

 Cheers

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


master branch issue in ceph.git

2013-01-17 Thread David Zafman

The latest code is hanging trying to start teuthology.  I used teuthology-nuke 
to clear old state and reboot the machines.  I was using my branch rebased to 
latest master and when that started failing I switched to the default config.  
It still keeps hanging here:

INFO:teuthology.task.ceph:Waiting until ceph is healthy...

$ ceph -s
   health HEALTH_WARN 5 pgs degraded; 108 pgs stuck unclean
   monmap e1: 3 mons at 
{0=10.214.131.23:6789/0,1=10.214.131.21:6789/0,2=10.214.131.20:6789/0}, 
election epoch 6, quorum 0,1,2 0,1,2
   osdmap e7: 9 osds: 9 up, 9 in
pgmap v25: 108 pgs: 103 active+remapped, 5 active+degraded; 0 bytes data, 
798 GB used, 3050 GB / 4055 GB avail
   mdsmap e2: 0/0/0 up

David Zafman
Senior Developer
david.zaf...@inktank.com



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: master branch issue in ceph.git

2013-01-17 Thread Sage Weil
On Thu, 17 Jan 2013, David Zafman wrote:
 
 The latest code is hanging trying to start teuthology.  I used 
 teuthology-nuke to clear old state and reboot the machines.  I was using my 
 branch rebased to latest master and when that started failing I switched to 
 the default config.  It still keeps hanging here:
 
 INFO:teuthology.task.ceph:Waiting until ceph is healthy...
 
 $ ceph -s
health HEALTH_WARN 5 pgs degraded; 108 pgs stuck unclean
monmap e1: 3 mons at 
 {0=10.214.131.23:6789/0,1=10.214.131.21:6789/0,2=10.214.131.20:6789/0}, 
 election epoch 6, quorum 0,1,2 0,1,2
osdmap e7: 9 osds: 9 up, 9 in
 pgmap v25: 108 pgs: 103 active+remapped, 5 active+degraded; 0 bytes data, 
 798 GB used, 3050 GB / 4055 GB avail
mdsmap e2: 0/0/0 up

git pull the latest teuthology.git master.  By default the crush map 
separates replicas across hosts now, and teuthology needed to be updated 
to do it by osd instead.

Sorry!

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html