Re: Bug #1047 reproduced

2011-12-05 Thread Amon Ott
On Friday 02 December 2011 wrote Sage Weil: > On Fri, 2 Dec 2011, Amon Ott wrote: > > On Thursday 01 December 2011 you wrote: > > > On all four nodes of my test cluster, MDS crashes with a trace like > > > that in bug #1047. Example and ceph.conf attached. Ceph server side is > > > from git master,

Re: os/FileJournal.cc: 1011: FAILED assert(seq >= last_committed_seq)

2011-12-05 Thread Martin Mailand
Hi Sage, it happened again, this time I have the log, it's attached. (gdb) thread 1 [Switching to thread 1 (Thread 24077)]#0 0x7f7995b83b3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) frame 11 #11 0x0072ee8d in FileJournal::committed_thru (this=0x1ebc000, seq=1683

Re: Cluster sync doesn't finsh

2011-12-05 Thread Martin Mailand
Hi Sam, is there anything new on this Issue, which I could test? -martin Am 19.11.2011 02:05, schrieb Samuel Just: I've inserted this bug as #1738. Unfortunately, this will take a bit of effort to fix. In the short term, you could switch to a crushmap where each node at the bottom level of t

'ceph -w' in 0.39

2011-12-05 Thread Christian Brunner
I've just updated to 0.39. Everything seems to be fine, except one minor thing I noticed: 'ceph -w' output stops after a few minutes. With "debug ms = 1" it ends with these lines: 2011-12-05 14:45:52.939300 7fc700637700 -- 10.255.0.21:0/14145 <== mon.2 10.255.0.22:6789/0 315 mon_observe_noti

Cannot create an rbd image

2011-12-05 Thread Guido Winkelmann
Hi, I cannot create a new rbd image in a newly created ceph cluster. When I try it, I get these messages: # rbd create testimage1 --size 1 2011-12-05 17:41:00.987352 7ffe8b5c3760 -- :/0 messenger.start 2011-12-05 17:41:00.987542 7ffe8b5c3760 -- :/1011427 --> 10.3.1.33:6789/0 -- auth(proto 0

Re: os/FileJournal.cc: 1011: FAILED assert(seq >= last_committed_seq)

2011-12-05 Thread Sage Weil
dc167bac7800c75df971bded4b54e0de48f7b18f (wip-journal branch) should fix this. Can you give it a test before I push to stable? Thanks! sage On Mon, 5 Dec 2011, Martin Mailand wrote: > Hi Sage, > it happened again, this time I have the log, it's attached. > > (gdb) thread 1 > [Switching to th

Re: Cannot create an rbd image

2011-12-05 Thread Josh Durgin
On 12/05/2011 08:57 AM, Guido Winkelmann wrote: Hi, I cannot create a new rbd image in a newly created ceph cluster. When I try it, I get these messages: # rbd create testimage1 --size 1 2011-12-05 17:41:00.994075 7ffe8b5c3760 librbd: failed to assign a block name for image create erro

Re: Cannot create an rbd image

2011-12-05 Thread Guido Winkelmann
Am Montag, 5. Dezember 2011, 09:57:22 schrieb Josh Durgin: > On 12/05/2011 08:57 AM, Guido Winkelmann wrote: > > Hi, > > > > I cannot create a new rbd image in a newly created ceph cluster. When I > > try it, I get these messages: > > > > # rbd create testimage1 --size 1 > > > > > 2011-12-

Re: os/FileJournal.cc: 1011: FAILED assert(seq >= last_committed_seq)

2011-12-05 Thread Martin Mailand
Hi Sage, I just updated the crashed osd, and it did not work very well. os/FileJournal.cc: 1173: FAILED assert(h->seq >= last_committed_seq) 1173os/FileJournal.cc: No such file or directory. in os/FileJournal.cc (gdb) p h->seq value has been optimized out (gdb) p last_committed_seq $

Re: Client receives 'connection refused' only after heavy use

2011-12-05 Thread Noah Watkins
On 12/04/2011 08:47 PM, Sage Weil wrote: It would be interested to see what `ls -al /proc/$pid/fd` looks like after the process has been running for a while... there is probably a leak somewhere. Here is the proc//fd contents. These contents should reflect an idle cluster with no active c

Re: Client receives 'connection refused' only after heavy use

2011-12-05 Thread Tommi Virtanen
On Mon, Dec 5, 2011 at 10:57, Noah Watkins wrote: > Here is the proc//fd contents. These contents should reflect an > idle cluster with no active clients. ... > root@issdm-23:~# ls -la /proc/1394/fd/ > lrwx-- 1 root root 64 2011-12-05 10:54 10 -> socket:[40002] Can we get a "sudo lsof -n -p 1

Re: Client receives 'connection refused' only after heavy use

2011-12-05 Thread Noah Watkins
On 12/05/2011 11:03 AM, Tommi Virtanen wrote: Can we get a "sudo lsof -n -p 1394" too? That would tell a little bit more about what the sockets are actually for.. root@issdm-23:~# lsof -n -p 1394 COMMAND PID USER FD TYPE DEVICE SIZE/OFFNODE NAME ceph-mds 1394 root cwd

Re: Client receives 'connection refused' only after heavy use

2011-12-05 Thread Sage Weil
Hmm, the other bit that would be useful would be a log with 'debug ms = 20' that is associated with these dumps. Then we can correlate the unused sockets with the log and see where they came from. Thanks! sage On Mon, 5 Dec 2011, Noah Watkins wrote: > On 12/05/2011 11:03 AM, Tommi Virtanen w

Re: os/FileJournal.cc: 1011: FAILED assert(seq >= last_committed_seq)

2011-12-05 Thread Sage Weil
The second fix is pushed to stable branch. Thanks for testing! (FWIW this is an old bug, so no need to rush to upgrade unless you're actually hitting it.) sage On Mon, 5 Dec 2011, Martin Mailand wrote: > Hi Sage, > I just updated the crashed osd, and it did not work very well. > > os/FileJo

Re: Client receives 'connection refused' only after heavy use

2011-12-05 Thread Tommi Virtanen
On Mon, Dec 5, 2011 at 11:05, Noah Watkins wrote: > root@issdm-23:~# lsof -n -p 1394 > COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME > ceph-mds 1394 root   12u  sock                0,6      0t0   11560 can't > identify protocol There's 506 of these guys in the listing. I

Re: [PATCH/RFC 0/6]: Introduction

2011-12-05 Thread Tommi Virtanen
On Mon, Nov 28, 2011 at 06:04, Andre Noll wrote: > Here is what I have so far. This patch set imports the documentation > of the OSD subcommands from the wiki to the preciously emtpy file > doc/ops/monitor.rst of the git repo. The first patch is just the > result of a cut & paste operation of the

Re: 'ceph -w' in 0.39

2011-12-05 Thread Sage Weil
Hi Christian, On Mon, 5 Dec 2011, Christian Brunner wrote: > I've just updated to 0.39. Everything seems to be fine, except one > minor thing I noticed: > > 'ceph -w' output stops after a few minutes. With "debug ms = 1" it > ends with these lines: > > 2011-12-05 14:45:52.939300 7fc700637700 --

Fwd: Ceph Best Practices

2011-12-05 Thread Steven Crothers
Hello, I have a quick question about the way to layout disks in the OSDs. I can't really find the information I'm looking for on the Wiki or in the existing mailing list archives that I have saved (4-5months worth). In the even a large OSD is being build (say 20 drives, same make/model 2TB). How