Re: [BUG] rbd discard should return OK even if rbd file does not exist

2012-11-19 Thread Stefan Priebe - Profihost AG

Hi Josh,

i got the following info from the qemu devs.

The discards get canceled by the client kernel as they take TOO long. 
This happens due to the fact that ceph handle discards as buffered I/O.


I see that there are max pending 800 requests. And rbd returns success 
first when there are no requests left. This is TOO long for the kernel.


I think discards must be changed to unbuffered I/O to solve this.

Greets,
Stefan

Am 18.11.2012 03:38, schrieb Josh Durgin:

On 11/17/2012 02:19 PM, Stefan Priebe wrote:

Hello list,

right now librbd returns an error if i issue a discard for a sector /
byterange where ceph does not have any file as i had never written to
this section.


Thanks for bringing this up again. I haven't had time to dig deeper
into it yet, but I definitely want to fix this for bobtail.


This is not correct. It should return 0 / OK in this case.

Stefan

Examplelog:
2012-11-02 21:06:17.649922 7f745f7fe700 20 librbd::AioRequest: WRITE_FLAT
2012-11-02 21:06:17.649924 7f745f7fe700 20 librbd::AioCompletion:
AioCompletion::complete_request() this=0x7f72cc05bd20
complete_cb=0x7f747021d4b0
2012-11-02 21:06:17.649924 7f747015c780  1 -- 10.10.0.2:0/2028325 --
10.10.0.18:6803/9687 -- osd_op(client.26862.0:3073
rb.0.1044.359ed6c7.0bde [delete] 3.bd84636 snapc 2=[]) v4 -- ?+0
0x7f72d81c69b0 con 0x7f74600dbf50
2012-11-02 21:06:17.649934 7f747015c780 20 librbd:  oid
rb.0.1044.359ed6c7.0bdf 0~4194304 from [4156556288,4194304]
2012-11-02 21:06:17.649972 7f7465a6e700  1 -- 10.10.0.2:0/2028325 ==
osd.1202 10.10.0.18:6806/9821 143  osd_op_reply(1652
rb.0.1044.359ed6c7.0652 [delete] ondisk = -2 (No such file or
directory)) v4  130+0+0 (2964367729 0 0) 0x7f72dc0f0090 con
0x7f74600e4350
2012-11-02 21:06:17.649994 7f745f7fe700 20 librbd::AioRequest: write
0x7f74600feab0 should_complete: r = -2


This last line isn't printing what's actually being returned to the
application. It's still in librbd's internal processing, and will be
converted to 0 for the application.

Could you try with the master or next branches? After the
'should_complete' line, there should be a line like:

date time thread_id 20 librbd::AioCompletion:
AioCompletion::finalize() rval 0 ...

That 'rval 0' shows the actual return value the application (qemu in
this case) will see.

Josh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] rbd discard should return OK even if rbd file does not exist

2012-11-19 Thread Stefan Priebe - Profihost AG
sorry meant the building in this case. The building of 900 requests 
takes too long. So the kernel starts to cancel these I/O requests.


  void AioCompletion::finish_adding_requests(CephContext *cct)
  {
ldout(cct, 20)  AioCompletion::finish_adding_requests   
(void*)this   pending   pending_count  dendl;

lock.Lock();
assert(building);
building = false;
if (!pending_count) {
  finalize(cct, rval);
  complete();
}
lock.Unlock();
  }

Finanlize and complete is only done when pending_count is 0 so all I/O 
is done.


Stefan

Am 19.11.2012 09:38, schrieb Stefan Priebe - Profihost AG:

Hi Josh,

i got the following info from the qemu devs.

The discards get canceled by the client kernel as they take TOO long.
This happens due to the fact that ceph handle discards as buffered I/O.

I see that there are max pending 800 requests. And rbd returns success
first when there are no requests left. This is TOO long for the kernel.

I think discards must be changed to unbuffered I/O to solve this.

Greets,
Stefan

Am 18.11.2012 03:38, schrieb Josh Durgin:

On 11/17/2012 02:19 PM, Stefan Priebe wrote:

Hello list,

right now librbd returns an error if i issue a discard for a sector /
byterange where ceph does not have any file as i had never written to
this section.


Thanks for bringing this up again. I haven't had time to dig deeper
into it yet, but I definitely want to fix this for bobtail.


This is not correct. It should return 0 / OK in this case.

Stefan

Examplelog:
2012-11-02 21:06:17.649922 7f745f7fe700 20 librbd::AioRequest:
WRITE_FLAT
2012-11-02 21:06:17.649924 7f745f7fe700 20 librbd::AioCompletion:
AioCompletion::complete_request() this=0x7f72cc05bd20
complete_cb=0x7f747021d4b0
2012-11-02 21:06:17.649924 7f747015c780  1 -- 10.10.0.2:0/2028325 --
10.10.0.18:6803/9687 -- osd_op(client.26862.0:3073
rb.0.1044.359ed6c7.0bde [delete] 3.bd84636 snapc 2=[]) v4 -- ?+0
0x7f72d81c69b0 con 0x7f74600dbf50
2012-11-02 21:06:17.649934 7f747015c780 20 librbd:  oid
rb.0.1044.359ed6c7.0bdf 0~4194304 from [4156556288,4194304]
2012-11-02 21:06:17.649972 7f7465a6e700  1 -- 10.10.0.2:0/2028325 ==
osd.1202 10.10.0.18:6806/9821 143  osd_op_reply(1652
rb.0.1044.359ed6c7.0652 [delete] ondisk = -2 (No such file or
directory)) v4  130+0+0 (2964367729 0 0) 0x7f72dc0f0090 con
0x7f74600e4350
2012-11-02 21:06:17.649994 7f745f7fe700 20 librbd::AioRequest: write
0x7f74600feab0 should_complete: r = -2


This last line isn't printing what's actually being returned to the
application. It's still in librbd's internal processing, and will be
converted to 0 for the application.

Could you try with the master or next branches? After the
'should_complete' line, there should be a line like:

date time thread_id 20 librbd::AioCompletion:
AioCompletion::finalize() rval 0 ...

That 'rval 0' shows the actual return value the application (qemu in
this case) will see.

Josh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] rbd discard should return OK even if rbd file does not exist

2012-11-19 Thread Stefan Priebe - Profihost AG

Hi Josh,

sorry for the bunch of mails.

It turns out not to be a bug in RBD or ceph but a bug in the linux 
kernel itself. Paolo from qemu told me the linux kernel should serialize 
these requests instead of sending the whole bunch and then hoping that 
all of them get's handling in miliseconds.


Stefan

Am 18.11.2012 03:38, schrieb Josh Durgin:

On 11/17/2012 02:19 PM, Stefan Priebe wrote:

Hello list,

right now librbd returns an error if i issue a discard for a sector /
byterange where ceph does not have any file as i had never written to
this section.


Thanks for bringing this up again. I haven't had time to dig deeper
into it yet, but I definitely want to fix this for bobtail.


This is not correct. It should return 0 / OK in this case.

Stefan

Examplelog:
2012-11-02 21:06:17.649922 7f745f7fe700 20 librbd::AioRequest: WRITE_FLAT
2012-11-02 21:06:17.649924 7f745f7fe700 20 librbd::AioCompletion:
AioCompletion::complete_request() this=0x7f72cc05bd20
complete_cb=0x7f747021d4b0
2012-11-02 21:06:17.649924 7f747015c780  1 -- 10.10.0.2:0/2028325 --
10.10.0.18:6803/9687 -- osd_op(client.26862.0:3073
rb.0.1044.359ed6c7.0bde [delete] 3.bd84636 snapc 2=[]) v4 -- ?+0
0x7f72d81c69b0 con 0x7f74600dbf50
2012-11-02 21:06:17.649934 7f747015c780 20 librbd:  oid
rb.0.1044.359ed6c7.0bdf 0~4194304 from [4156556288,4194304]
2012-11-02 21:06:17.649972 7f7465a6e700  1 -- 10.10.0.2:0/2028325 ==
osd.1202 10.10.0.18:6806/9821 143  osd_op_reply(1652
rb.0.1044.359ed6c7.0652 [delete] ondisk = -2 (No such file or
directory)) v4  130+0+0 (2964367729 0 0) 0x7f72dc0f0090 con
0x7f74600e4350
2012-11-02 21:06:17.649994 7f745f7fe700 20 librbd::AioRequest: write
0x7f74600feab0 should_complete: r = -2


This last line isn't printing what's actually being returned to the
application. It's still in librbd's internal processing, and will be
converted to 0 for the application.

Could you try with the master or next branches? After the
'should_complete' line, there should be a line like:

date time thread_id 20 librbd::AioCompletion:
AioCompletion::finalize() rval 0 ...

That 'rval 0' shows the actual return value the application (qemu in
this case) will see.

Josh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] rbd discard should return OK even if rbd file does not exist

2012-11-19 Thread Stefan Priebe - Profihost AG

But strange enough this works fine with normal iscsi target... no idea why.

Stefan
Am 19.11.2012 11:15, schrieb Stefan Priebe - Profihost AG:

Hi Josh,

sorry for the bunch of mails.

It turns out not to be a bug in RBD or ceph but a bug in the linux
kernel itself. Paolo from qemu told me the linux kernel should serialize
these requests instead of sending the whole bunch and then hoping that
all of them get's handling in miliseconds.

Stefan

Am 18.11.2012 03:38, schrieb Josh Durgin:

On 11/17/2012 02:19 PM, Stefan Priebe wrote:

Hello list,

right now librbd returns an error if i issue a discard for a sector /
byterange where ceph does not have any file as i had never written to
this section.


Thanks for bringing this up again. I haven't had time to dig deeper
into it yet, but I definitely want to fix this for bobtail.


This is not correct. It should return 0 / OK in this case.

Stefan

Examplelog:
2012-11-02 21:06:17.649922 7f745f7fe700 20 librbd::AioRequest:
WRITE_FLAT
2012-11-02 21:06:17.649924 7f745f7fe700 20 librbd::AioCompletion:
AioCompletion::complete_request() this=0x7f72cc05bd20
complete_cb=0x7f747021d4b0
2012-11-02 21:06:17.649924 7f747015c780  1 -- 10.10.0.2:0/2028325 --
10.10.0.18:6803/9687 -- osd_op(client.26862.0:3073
rb.0.1044.359ed6c7.0bde [delete] 3.bd84636 snapc 2=[]) v4 -- ?+0
0x7f72d81c69b0 con 0x7f74600dbf50
2012-11-02 21:06:17.649934 7f747015c780 20 librbd:  oid
rb.0.1044.359ed6c7.0bdf 0~4194304 from [4156556288,4194304]
2012-11-02 21:06:17.649972 7f7465a6e700  1 -- 10.10.0.2:0/2028325 ==
osd.1202 10.10.0.18:6806/9821 143  osd_op_reply(1652
rb.0.1044.359ed6c7.0652 [delete] ondisk = -2 (No such file or
directory)) v4  130+0+0 (2964367729 0 0) 0x7f72dc0f0090 con
0x7f74600e4350
2012-11-02 21:06:17.649994 7f745f7fe700 20 librbd::AioRequest: write
0x7f74600feab0 should_complete: r = -2


This last line isn't printing what's actually being returned to the
application. It's still in librbd's internal processing, and will be
converted to 0 for the application.

Could you try with the master or next branches? After the
'should_complete' line, there should be a line like:

date time thread_id 20 librbd::AioCompletion:
AioCompletion::finalize() rval 0 ...

That 'rval 0' shows the actual return value the application (qemu in
this case) will see.

Josh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd tool changed format? (breaks compatibility)

2012-11-19 Thread Constantinos Venetsanopoulos

On 11/16/2012 07:14 PM, Josh Durgin wrote:

On 11/16/2012 06:36 AM, Constantinos Venetsanopoulos wrote:

Hello ceph team,

As you may already know, our team in GRNET is building a complete open
source cloud platform called Synnefo [1], which already powers our
production public cloud service ~okeanos [2].

Synnefo is using Google Ganeti for the low level VM management part [3].
As of Jan 2012, we have merged to upstream Ganeti support for VM disks
on RADOS [4].

Today we received some feedback, that other people trying to run Ganeti
with RADOS get an error because probably the output of the 'rbd
showmapped' command has changed.

I'd like to ask if indeed the output format of the rbd tool has changed.
More specifically:

1. Does the 'rbd showmapped' command still returns just the headers if
no device is mapped?


No


Ack.





2. Has the separator between the 'rbd showmapped' columns changed
from \t to ?


Yes, this is in the release notes for 0.54 
(http://ceph.com/docs/master/release-notes/#v0-54).




Ack.


I don't have the latest rbd tool setup (but rather
ceph-common=0.48.1argonaut-1~bpo60+1), so I can't test it right now,
but I see this commit:

https://github.com/ceph/ceph/commit/bed55369a96c2651f513b8c9b1a7bb92fb87550a 



Yeah, that's the commit that changed it.


How stable can we consider rbd tool's output format?
This is something we want to run in production environment. Using the
tool rather than the library makes things much simpler.


Generally it won't change much, but I don't think it should be
considered entirely unchanging. We'll add it to the release notes when
the output does change. We'll probably switch other commands to use
TextTable too, with the same results as with showmapped and lock list.
We could send a message to the mailing list when the output changes as
well, so you can prepare for a future release.


That would be great, and highly appreciated. Please drop us an email at
the following mailing lists, when the rbd tool's format changes:

synnefo-de...@googlegroups.com
ganeti-de...@googlegroups.com



Perhaps we should add a --format json|plain option so you don't have to
rely on particular formatting, you just parse the json. This would
match existing usage by many 'ceph ...' commands, and be easier
for scripts to use in general.


That would be even better! That would be the best approach for us, since
we use it inside python code. Parsing a json is very simple and we will be
able to maintain compatibility even when the format changes.

Thanks,
Consantinos

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-19 Thread Sébastien Han
Hello Mark,

First of all, thank you again for another accurate answer :-).

 I would have expected write aggregation and cylinder affinity to
 have eliminated some seeks and improved rotational latency resulting
 in better than theoretical random write throughput.  Against those
 expectations 763/850 IOPS is not so impressive.  But, it looks to
 me like you were running fio in a 1G file with 100 parallel requests.
 The default RBD stripe width is 4M.  This means that those 100
 parallel requests were being spread across 256 (1G/4M) objects.
 People in the know tell me that writes to a single object are
 serialized, which means that many of those (potentially) parallel
 writes were to the same object, and hence serialized.  This would
 increase the average request time for the colliding operations,
 and reduce the aggregate throughput correspondingly.  Use a
 bigger file (or a narrower stripe) and this will get better.


I followed your advice and used a bigger file (10G) and an iodepth of
128 and I've been able to reach ~27k iops for rand reads but I
couldn't reach more than 870 iops in randwrites... It's kind of
expected. But the thing a still don't understand is: why the
sequential read/writes are lower than the randoms onces? Or maybe do I
just need to care about the bandwidth for those values?

Thank you.

Regards.
--
Bien cordialement.
Sébastien HAN.


On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe mark.ka...@inktank.com wrote:
 On 11/15/2012 12:23 PM, Sébastien Han wrote:

 First of all, I would like to thank you for this well explained,
 structured and clear answer. I guess I got better IOPS thanks to the 10K
 disks.


 10K RPM would bring your per-drive throughput (for 4K random writes)
 up to 142 IOPS and your aggregate cluster throughput up to 1700.
 This would predict a corresponding RADOSbench throughput somewhere
 above 425 (how much better depending on write aggregation and cylinder
 affinity).  Your RADOSbench 708 now seems even more reasonable.

 To be really honest I wasn't so concerned about the RADOS benchmarks
 but more about the RBD fio benchmarks and the amont of IOPS that comes
 out of it, which I found à bit to low.


 Sticking with 4K random writes, it looks to me like you were running
 fio with libaio (which means direct, no buffer cache).  Because it
 is direct, every I/O operation is really happening and the best
 sustained throughput you should expect from this cluster is
 the aggregate raw fio 4K write throughput (1700 IOPS) divided
 by two copies = 850 random 4K writes per second.  If I read the
 output correctly you got 763 or about 90% of back-of-envelope.

 BUT, there are some footnotes (there always are with performance)

 If you had been doing buffered I/O you would have seen a lot more
 (up front) benefit from page caching ... but you wouldn't have been
 measuring real (and hence sustainable) I/O throughput ... which is
 ultimately limited by the heads on those twelve disk drives, where
 all of those writes ultimately wind up.  It is easy to be fast
 if you aren't really doing the writes :-)

 I would have expected write aggregation and cylinder affinity to
 have eliminated some seeks and improved rotational latency resulting
 in better than theoretical random write throughput.  Against those
 expectations 763/850 IOPS is not so impressive.  But, it looks to
 me like you were running fio in a 1G file with 100 parallel requests.
 The default RBD stripe width is 4M.  This means that those 100
 parallel requests were being spread across 256 (1G/4M) objects.
 People in the know tell me that writes to a single object are
 serialized, which means that many of those (potentially) parallel
 writes were to the same object, and hence serialized.  This would
 increase the average request time for the colliding operations,
 and reduce the aggregate throughput correspondingly.  Use a
 bigger file (or a narrower stripe) and this will get better.

 Thus, getting 763 random 4K write IOPs out of those 12 drives
 still sounds about right to me.


 On 15 nov. 2012, at 19:43, Mark Kampe mark.ka...@inktank.com wrote:

 Dear Sebastien,

 Ross Turn forwarded me your e-mail.  You sent a great deal
 of information, but it was not immediately obvious to me
 what your specific concern was.

 You have 4 servers, 3 OSDs per, 2 copy, and you measured a
 radosbench (4K object creation) throughput of 2.9MB/s
 (or 708 IOPS).  I infer that you were disappointed by
 this number, but it looks right to me.

 Assuming typical 7200 RPM drives, I would guess that each
 of them would deliver a sustained direct 4K random write
 performance in the general neighborhood of:
 4ms seek (short seeks with write-settle-downs)
 4ms latency (1/2 rotation)
 0ms write (4K/144MB/s ~ 30us)
 -
 8ms or about 125 IOPS

 Your twelve drives should therefore have a sustainable
 aggregate direct 4K random write throughput of 1500 IOPS.

 Each 4K object create involves four writes (two copies,
 each getting 

Re: RBD fio Performance concerns

2012-11-19 Thread Sébastien Han
 If I remember, you use fio with 4MB block size for sequential.
 So it's normal that you have less ios, but more bandwith.

That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99  iops] [eta 02m:59s]
seq-read: (groupid=0, jobs=1): err= 0: pid=15096
  read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
bw (KB/s) : min=0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06
  cpu  : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1%
 issued r/w/d: total=200473/0/0, short=0/0/0

 lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
rand-read: (groupid=1, jobs=1): err= 0: pid=16846
  read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
 lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62
  cpu  : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1%
 issued r/w/d: total=1632349/0/0, short=0/0/0

 lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
seq-write: (groupid=2, jobs=1): err= 0: pid=18653
  write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
 lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
bw (KB/s) : min=7, max= 2165, per=104.03%, avg=764.65, stdev=353.97
  cpu  : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, =64=99.4%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1%
 issued r/w/d: total=0/11171/0, short=0/0/0

 lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
 lat (msec): 1000=12.73%, 2000=66.36%, =2000=13.20%
rand-write: (groupid=3, jobs=1): err= 0: pid=20446
  write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
 lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
bw (KB/s) : min=0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45
  cpu  : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=99.9%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1%
 issued r/w/d: total=0/52147/0, short=0/0/0

 lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
 lat (msec): 1000=2.91%, 2000=5.75%, =2000=1.33%

Run status group 0 (all jobs):
   READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
mint=60053msec, maxt=60053msec

Run status group 1 (all jobs):
   READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
maxb=111425KB/s, mint=60005msec, maxt=60005msec

Run status group 2 (all jobs):
  WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
mint=60725msec, maxt=60725msec

Run status group 3 (all jobs):
  WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
mint=60822msec, maxt=60822msec

Disk stats (read/write):
  rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
in_queue=33434120, util=99.79%

Cheers!
--
Bien cordialement.
Sébastien HAN.


On Mon, Nov 19, 2012 at 4:28 PM, Alexandre DERUMIER aderum...@odiso.com wrote:
why the
sequential read/writes are lower than the randoms onces? Or maybe do I
just need to care about the bandwidth for those values?

 If I remember, you use fio with 4MB block size for sequential.
 So it's normal that you have less ios, but more 

Re: RBD fio Performance concerns

2012-11-19 Thread Sage Weil
On Mon, 19 Nov 2012, S?bastien Han wrote:
  If I remember, you use fio with 4MB block size for sequential.
  So it's normal that you have less ios, but more bandwith.
 
 That's correct for some of the benchmarks. However even with 4K for
 seq, I still get less IOPS. See below my last fio:

Small IOs striped over large objects tends to mean that many IOs may get 
piled up behind a single object at a time.  There is a new striping 
feature in RBD that lets you stripe small blocks over larger objects to 
mitigate this, but it means slower performance the rest of the time, and 
is only really useful for specific workloads (e.g., database journal 
file/device).

sage

 
 # fio rbd-bench.fio
 seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
 fio 1.59
 Starting 4 processes
 Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99  iops] [eta 02m:59s]
 seq-read: (groupid=0, jobs=1): err= 0: pid=15096
   read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
 slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
  lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
 bw (KB/s) : min=0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06
   cpu  : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.1%
  issued r/w/d: total=200473/0/0, short=0/0/0
 
  lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
 rand-read: (groupid=1, jobs=1): err= 0: pid=16846
   read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
 slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
  lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, 
 stdev=648.62
   cpu  : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.1%
  issued r/w/d: total=1632349/0/0, short=0/0/0
 
  lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
 seq-write: (groupid=2, jobs=1): err= 0: pid=18653
   write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
 slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
 clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
  lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
 bw (KB/s) : min=7, max= 2165, per=104.03%, avg=764.65, stdev=353.97
   cpu  : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, =64=99.4%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.1%
  issued r/w/d: total=0/11171/0, short=0/0/0
 
  lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
  lat (msec): 1000=12.73%, 2000=66.36%, =2000=13.20%
 rand-write: (groupid=3, jobs=1): err= 0: pid=20446
   write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
 slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
 clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
  lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
 bw (KB/s) : min=0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45
   cpu  : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=99.9%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.1%
  issued r/w/d: total=0/52147/0, short=0/0/0
 
  lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
  lat (msec): 1000=2.91%, 2000=5.75%, =2000=1.33%
 
 Run status group 0 (all jobs):
READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
 mint=60053msec, maxt=60053msec
 
 Run status group 1 (all jobs):
READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
 maxb=111425KB/s, mint=60005msec, maxt=60005msec
 
 Run status group 2 (all jobs):
   WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
 mint=60725msec, maxt=60725msec
 
 Run status group 3 (all jobs):
   WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
 

Re: RBD fio Performance concerns

2012-11-19 Thread Mark Kampe

Recall:
   1. RBD volumes are striped (4M wide) across RADOS objects
   2. distinct writes to a single RADOS object are serialized

Your sequential 4K writes are direct, depth=256, so there are
(at all times) 256 writes queued to the same object.  All of
your writes are waiting through a very long line, which is adding
horrendous latency.

If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).

We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize).  Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write aggregation.


That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99  iops] [eta 02m:59s]
seq-read: (groupid=0, jobs=1): err= 0: pid=15096
   read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
 slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
  lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
 bw (KB/s) : min=0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06
   cpu  : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1%
  issued r/w/d: total=200473/0/0, short=0/0/0

  lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
rand-read: (groupid=1, jobs=1): err= 0: pid=16846
   read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
 slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
  lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, 
stdev=648.62
   cpu  : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1%
  issued r/w/d: total=1632349/0/0, short=0/0/0

  lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
seq-write: (groupid=2, jobs=1): err= 0: pid=18653
   write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
 slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
 clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
  lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
 bw (KB/s) : min=7, max= 2165, per=104.03%, avg=764.65, stdev=353.97
   cpu  : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, =64=99.4%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1%
  issued r/w/d: total=0/11171/0, short=0/0/0

  lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
  lat (msec): 1000=12.73%, 2000=66.36%, =2000=13.20%
rand-write: (groupid=3, jobs=1): err= 0: pid=20446
   write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
 slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
 clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
  lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
 bw (KB/s) : min=0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45
   cpu  : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=99.9%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1%
  issued r/w/d: total=0/52147/0, short=0/0/0

  lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
  lat (msec): 1000=2.91%, 2000=5.75%, =2000=1.33%

Run status group 0 (all jobs):
READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
mint=60053msec, maxt=60053msec

Run status group 1 (all jobs):
READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,

Re: Many dns domain names in radosgw

2012-11-19 Thread Yehuda Sadeh
On Sat, Nov 17, 2012 at 1:50 PM, Sławomir Skowron szi...@gmail.com wrote:
 Welcome,

 I have a question. Is there, any way to support multiple domains names
 in one radosgw on virtual host type connection in S3 ??

Are you aiming at having multiple virtual domain names pointing at the
same bucket?

Currently a gateway can only be set up with a single domain, so the
virtual bucket scheme will only translate subdomains of that domain as
buckets. Starting at 0.55 there will be a way to point alternative
domains to a specific bucket (by modifying their dns CNAME record),
however, It doesn't sound like it's what you're looking for.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-19 Thread Sébastien Han
@Sage, thanks for the info :)
@Mark:

 If you want to do sequential I/O, you should do it buffered
 (so that the writes can be aggregated) or with a 4M block size
 (very efficient and avoiding object serialization).

The original benchmark has been performed with 4M block size. And as
you can see I still get more IOPS with rand than seq... I just tried
with 4M without direct I/O, still the same. I can print fio results if
it's needed.

 We do direct writes for benchmarking, not because it is a reasonable
 way to do I/O, but because it bypasses the buffer cache and enables
 us to directly measure cluster I/O throughput (which is what we are
 trying to optimize).  Applications should usually do buffered I/O,
 to get the (very significant) benefits of caching and write aggregation.

I know why I use direct I/O. It's synthetic benchmarks, it's far away
from a real life scenario and how common applications works. I just
try to see the maximum I/O throughput that I can get from my RBD. All
my applications use buffered I/O.

@Alexandre: is it the same for you? or do you always get more IOPS with seq?

Thanks to all of you..


On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe mark.ka...@inktank.com wrote:
 Recall:
1. RBD volumes are striped (4M wide) across RADOS objects
2. distinct writes to a single RADOS object are serialized

 Your sequential 4K writes are direct, depth=256, so there are
 (at all times) 256 writes queued to the same object.  All of
 your writes are waiting through a very long line, which is adding
 horrendous latency.

 If you want to do sequential I/O, you should do it buffered
 (so that the writes can be aggregated) or with a 4M block size
 (very efficient and avoiding object serialization).

 We do direct writes for benchmarking, not because it is a reasonable
 way to do I/O, but because it bypasses the buffer cache and enables
 us to directly measure cluster I/O throughput (which is what we are
 trying to optimize).  Applications should usually do buffered I/O,
 to get the (very significant) benefits of caching and write aggregation.


 That's correct for some of the benchmarks. However even with 4K for
 seq, I still get less IOPS. See below my last fio:

 # fio rbd-bench.fio
 seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
 iodepth=256
 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
 iodepth=256
 fio 1.59
 Starting 4 processes
 Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99  iops] [eta
 02m:59s]
 seq-read: (groupid=0, jobs=1): err= 0: pid=15096
read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
  slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
  clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
   lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
  bw (KB/s) : min=0, max=14406, per=31.89%, avg=4258.24,
 stdev=6239.06
cpu  : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
 =64=100.0%
   submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.1%
   issued r/w/d: total=200473/0/0, short=0/0/0

   lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
 rand-read: (groupid=1, jobs=1): err= 0: pid=16846
read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
  slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
  clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
   lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
  bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
 stdev=648.62
cpu  : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
 =64=100.0%
   submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.1%
   issued r/w/d: total=1632349/0/0, short=0/0/0

   lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
 seq-write: (groupid=2, jobs=1): err= 0: pid=18653
write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
  slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
  clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
   lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
  bw (KB/s) : min=7, max= 2165, per=104.03%, avg=764.65,
 stdev=353.97
cpu  : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
 =64=99.4%
   submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.1%

Re: Many dns domain names in radosgw

2012-11-19 Thread Sławomir Skowron
Yes. I am looking for using domain x.com, and y.com with virtual host
buckets like b.x.com, c.y.com

But if it's not possible i can handle this with cname *.x.com and use
only b and c on x.com domain.

Thanks for response.

19 lis 2012 19:02, Yehuda Sadeh yeh...@inktank.com napisał(a):

 On Sat, Nov 17, 2012 at 1:50 PM, Sławomir Skowron szi...@gmail.com wrote:
  Welcome,
 
  I have a question. Is there, any way to support multiple domains names
  in one radosgw on virtual host type connection in S3 ??
 
 Are you aiming at having multiple virtual domain names pointing at the
 same bucket?

 Currently a gateway can only be set up with a single domain, so the
 virtual bucket scheme will only translate subdomains of that domain as
 buckets. Starting at 0.55 there will be a way to point alternative
 domains to a specific bucket (by modifying their dns CNAME record),
 however, It doesn't sound like it's what you're looking for.

 Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Remote Ceph Install

2012-11-19 Thread Blackwell, Edward
Hi,
I work for Harris Corporation, and we are investigating Ceph as a potential 
solution to a storage problem that one of our government customers is currently 
having.  I've already created a two-node cluster on a couple of VMs with 
another VM acting as an administrative client.  The cluster was created using 
some installation instructions supplied to us via Inktank, and through the use 
of the ceph-deploy script.  Aside from a couple of quirky discrepancies between 
the installation instructions and my environment, everything went well.  My 
issue has cropped up on the second cluster I'm trying to create, which is using 
a VM and a non-VM server for the nodes in the cluster.  Eventually, both nodes 
in this cluster will be non-VMs, but we're still waiting on the hardware for 
the second node, so I'm using a VM in the meantime just to get this second 
cluster up and going.  Of course, the administrative client node is still a VM.

The problem that I'm having with this second cluster concerns the non-VM server 
(elsceph01 for the sake of the commands mentioned from here on out).  In 
particular, the issue crops up with the ceph-deploy install elsceph01 command 
I'm executing on my client VM (cephclient01) to install Ceph on the non-VM 
server. The installation doesn't appear to be working as the command does not 
return the OK message that it should when it completes successfully.  I've 
tried using the verbose option on the command to see if that sheds any light on 
the subject, but alas, it does not:


root@cephclient01:~/my-admin-sandbox# ceph-deploy -v install elsceph01
DEBUG:ceph_deploy.install:Installing stable version argonaut on cluster ceph 
hosts elsceph01
DEBUG:ceph_deploy.install:Detecting platform for host elsceph01 ...
DEBUG:ceph_deploy.install:Installing for Ubuntu 12.04 on host elsceph01 ...
root@cephclient01:~/my-admin-sandbox#


Would you happen to have a breakdown of the commands being executed by the 
ceph-deploy script behind the scenes so I can maybe execute them one-by-one to 
see where the error is?  I have confirmed that it looks like the installation 
of the software has succeeded as I did a which ceph command on elsceph01, and 
it reported back /usr/bin/ceph.  Also, /etc/ceph/ceph.conf is there, and it 
matches the file created by the ceph-deploy new ... command on the client.  
Does the install command do a mkcephfs behind the scenes?  The reason I ask is 
that when I do the ceph-deploy mon command from the client, which is the next 
command listed in the instructions to do, I get this output:


root@cephclient01:~/my-admin-sandbox# ceph-deploy mon
creating /var/lib/ceph/tmp/ceph-ELSCEPH01.mon.keyring
2012-11-15 11:35:38.954261 7f7a6c274780 -1 asok(0x260b000) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
bind the UNIX domain socket to '/var/run/ceph/ceph-mon.ELSCEPH01.asok': (2) No 
such file or directory
Traceback (most recent call last):
  File /usr/local/bin/ceph-deploy, line 9, in module
load_entry_point('ceph-deploy==0.0.1', 'console_scripts', 'ceph-deploy')()
  File /root/ceph-deploy/ceph_deploy/cli.py, line 80, in main
added entity mon. auth auth(auid = 18446744073709551615 
key=AQBWDj5QAP6LHhAAskVBnUkYHJ7eYREmKo5qKA== with 0 caps)
return args.func(args)
mon/MonMap.h: In function 'void MonMap::add(const string, const 
entity_addr_t)' thread 7f7a6c274780 time 2012-11-15 11:35:38.955024
mon/MonMap.h: 97: FAILED assert(addr_name.count(addr) == 0)
ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
1: (MonMap::build_from_host_list(std::string, std::string)+0x738) [0x5988b8]
2: (MonMap::build_initial(CephContext*, std::ostream)+0x113) [0x59bd53]
3: (main()+0x12bb) [0x45ffab]
4: (__libc_start_main()+0xed) [0x7f7a6a6d776d]
5: ceph-mon() [0x462a19]
NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.
2012-11-15 11:35:38.955924 7f7a6c274780 -1 mon/MonMap.h: In function 'void 
MonMap::add(const string, const entity_addr_t)' thread 7f7a6c274780 time 
2012-11-15 11:35:38.955024
mon/MonMap.h: 97: FAILED assert(addr_name.count(addr) == 0)

ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
1: (MonMap::build_from_host_list(std::string, std::string)+0x738) [0x5988b8]
2: (MonMap::build_initial(CephContext*, std::ostream)+0x113) [0x59bd53]
3: (main()+0x12bb) [0x45ffab]
4: (__libc_start_main()+0xed) [0x7f7a6a6d776d]
5: ceph-mon() [0x462a19]
NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

-1 2012-11-15 11:35:38.954261 7f7a6c274780 -1 asok(0x260b000) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
bind the UNIX domain socket to '/var/run/ceph/ceph-mon.ELSCEPH01.asok': (2) No 
such file or directory
 0 2012-11-15 11:35:38.955924 7f7a6c274780 -1 mon/MonMap.h: In function 
'void MonMap::add(const string, const entity_addr_t)' thread 7f7a6c274780 
time 2012-11-15 11:35:38.955024

Re: [PATCH] rbd: get rid of rbd_{get,put}_dev()

2012-11-19 Thread Dan Mick

Reviewed-by: Dan Mick dan.m...@inktank.com

On 11/16/2012 07:43 AM, Alex Elder wrote:

The functions rbd_get_dev() and rbd_put_dev() are trivial wrappers
that add no values, and their existence suggests they may do more
than what they do.

Get rid of them.

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |   14 ++
  1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 9d9a2f3..f4b5a64 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -337,16 +337,6 @@ void rbd_warn(struct rbd_device *rbd_dev, const
char *fmt, ...)
  #  define rbd_assert(expr)((void) 0)
  #endif /* !RBD_DEBUG */

-static struct device *rbd_get_dev(struct rbd_device *rbd_dev)
-{
-   return get_device(rbd_dev-dev);
-}
-
-static void rbd_put_dev(struct rbd_device *rbd_dev)
-{
-   put_device(rbd_dev-dev);
-}
-
  static int rbd_dev_refresh(struct rbd_device *rbd_dev, u64 *hver);
  static int rbd_dev_v2_refresh(struct rbd_device *rbd_dev, u64 *hver);

@@ -357,7 +347,7 @@ static int rbd_open(struct block_device *bdev,
fmode_t mode)
if ((mode  FMODE_WRITE)  rbd_dev-mapping.read_only)
return -EROFS;

-   rbd_get_dev(rbd_dev);
+   (void) get_device(rbd_dev-dev);
set_device_ro(bdev, rbd_dev-mapping.read_only);
rbd_dev-open_count++;

@@ -370,7 +360,7 @@ static int rbd_release(struct gendisk *disk, fmode_t
mode)

rbd_assert(rbd_dev-open_count  0);
rbd_dev-open_count--;
-   rbd_put_dev(rbd_dev);
+   put_device(rbd_dev-dev);

return 0;
  }


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rbd block driver fix race between aio completition and aio cancel

2012-11-19 Thread Stefan Priebe

From: Stefan Priebe s.pri...@profhost.ag

This one fixes a race qemu also had in iscsi block driver between
cancellation and io completition.

qemu_rbd_aio_cancel was not synchronously waiting for the end of
the command.

It also removes the useless cancelled flag and introduces instead
a status flag with EINPROGRESS like iscsi block driver.

Signed-off-by: Stefan Priebe s.pri...@profihost.ag
---
 block/rbd.c |   19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 5a0f79f..7b3bcbb 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -76,7 +76,7 @@ typedef struct RBDAIOCB {
 int64_t sector_num;
 int error;
 struct BDRVRBDState *s;
-int cancelled;
+int status;
 } RBDAIOCB;
  typedef struct RADOSCB {
@@ -376,9 +376,7 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 RBDAIOCB *acb = rcb-acb;
 int64_t r;
 -if (acb-cancelled) {
-qemu_vfree(acb-bounce);
-qemu_aio_release(acb);
+if (acb-bh) {
 goto done;
 }
 @@ -406,9 +404,12 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 acb-ret = r;
 }
 }
+acb-status = acb-ret;
+
 /* Note that acb-bh can be NULL in case where the aio was 
cancelled */

 acb-bh = qemu_bh_new(rbd_aio_bh_cb, acb);
 qemu_bh_schedule(acb-bh);
+
 done:
 g_free(rcb);
 }
@@ -573,7 +574,10 @@ static void qemu_rbd_close(BlockDriverState *bs)
 static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb)
 {
 RBDAIOCB *acb = (RBDAIOCB *) blockacb;
-acb-cancelled = 1;
+
+while (acb-status == -EINPROGRESS) {
+qemu_aio_wait();
+}
 }
  static AIOPool rbd_aio_pool = {
@@ -642,10 +646,11 @@ static void rbd_aio_bh_cb(void *opaque)
 qemu_iovec_from_buf(acb-qiov, 0, acb-bounce, acb-qiov-size);
 }
 qemu_vfree(acb-bounce);
-acb-common.cb(acb-common.opaque, (acb-ret  0 ? 0 : acb-ret));
 qemu_bh_delete(acb-bh);
 acb-bh = NULL;
 +acb-common.cb(acb-common.opaque, (acb-ret  0 ? 0 : acb-ret));
+
 qemu_aio_release(acb);
 }
 @@ -689,8 +694,8 @@ static BlockDriverAIOCB 
*rbd_start_aio(BlockDriverState *bs,

 acb-ret = 0;
 acb-error = 0;
 acb-s = s;
-acb-cancelled = 0;
 acb-bh = NULL;
+acb-status = -EINPROGRESS;
  if (cmd == RBD_AIO_WRITE) {
 qemu_iovec_to_buf(acb-qiov, 0, acb-bounce, qiov-size);
--
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2012-11-19 Thread Stefan Priebe
From Stefan Priebe s.pri...@profihost.ag # This line is ignored.
From: Stefan Priebe s.pri...@profihost.ag
Cc: pve-de...@pve.proxmox.com
Cc: pbonz...@redhat.com
Cc: ceph-devel@vger.kernel.org
Subject: QEMU/PATCH: rbd block driver: fix race between completition and cancel
In-Reply-To:


ve-de...@pve.proxmox.com
pbonz...@redhat.com
ceph-devel@vger.kernel.org
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-19 Thread Sébastien Han
Which iodepth did you use for those benchs?


 I really don't understand why I can't get more rand read iops with 4K block 
 ...

Me neither, hope to get some clarification from the Inktank guys. It
doesn't make any sense to me...
--
Bien cordialement.
Sébastien HAN.


On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER aderum...@odiso.com wrote:
@Alexandre: is it the same for you? or do you always get more IOPS with seq?

 rand read 4K : 6000 iops
 seq read 4K : 3500 iops
 seq read 4M : 31iops (1gigabit client bandwith limit)

 rand write 4k: 6000iops  (tmpfs journal)
 seq write 4k: 1600iops
 seq write 4M : 31iops (1gigabit client bandwith limit)


 I really don't understand why I can't get more rand read iops with 4K block 
 ...

 I try with high end cpu for client, it doesn't change nothing.
 But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around 15% on 
 cluster during read bench)


 - Mail original -

 De: Sébastien Han han.sebast...@gmail.com
 À: Mark Kampe mark.ka...@inktank.com
 Cc: Alexandre DERUMIER aderum...@odiso.com, ceph-devel 
 ceph-devel@vger.kernel.org
 Envoyé: Lundi 19 Novembre 2012 19:03:40
 Objet: Re: RBD fio Performance concerns

 @Sage, thanks for the info :)
 @Mark:

 If you want to do sequential I/O, you should do it buffered
 (so that the writes can be aggregated) or with a 4M block size
 (very efficient and avoiding object serialization).

 The original benchmark has been performed with 4M block size. And as
 you can see I still get more IOPS with rand than seq... I just tried
 with 4M without direct I/O, still the same. I can print fio results if
 it's needed.

 We do direct writes for benchmarking, not because it is a reasonable
 way to do I/O, but because it bypasses the buffer cache and enables
 us to directly measure cluster I/O throughput (which is what we are
 trying to optimize). Applications should usually do buffered I/O,
 to get the (very significant) benefits of caching and write aggregation.

 I know why I use direct I/O. It's synthetic benchmarks, it's far away
 from a real life scenario and how common applications works. I just
 try to see the maximum I/O throughput that I can get from my RBD. All
 my applications use buffered I/O.

 @Alexandre: is it the same for you? or do you always get more IOPS with seq?

 Thanks to all of you..


 On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe mark.ka...@inktank.com wrote:
 Recall:
 1. RBD volumes are striped (4M wide) across RADOS objects
 2. distinct writes to a single RADOS object are serialized

 Your sequential 4K writes are direct, depth=256, so there are
 (at all times) 256 writes queued to the same object. All of
 your writes are waiting through a very long line, which is adding
 horrendous latency.

 If you want to do sequential I/O, you should do it buffered
 (so that the writes can be aggregated) or with a 4M block size
 (very efficient and avoiding object serialization).

 We do direct writes for benchmarking, not because it is a reasonable
 way to do I/O, but because it bypasses the buffer cache and enables
 us to directly measure cluster I/O throughput (which is what we are
 trying to optimize). Applications should usually do buffered I/O,
 to get the (very significant) benefits of caching and write aggregation.


 That's correct for some of the benchmarks. However even with 4K for
 seq, I still get less IOPS. See below my last fio:

 # fio rbd-bench.fio
 seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
 iodepth=256
 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
 iodepth=256
 fio 1.59
 Starting 4 processes
 Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
 02m:59s]
 seq-read: (groupid=0, jobs=1): err= 0: pid=15096
 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
 slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
 bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
 stdev=6239.06
 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
 =64=100.0%
 submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
 complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.1%
 issued r/w/d: total=200473/0/0, short=0/0/0

 lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
 rand-read: (groupid=1, jobs=1): err= 0: pid=16846
 read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
 slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
 lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
 stdev=648.62
 cpu : usr=8.26%, sys=49.11%, 

Re: RBD fio Performance concerns

2012-11-19 Thread Sébastien Han
Hello Mark,

See below my benchmarks results:

-RADOS Bench with 4M block size write:

# rados -p bench bench 300 write -t 32 --no-cleanup
Maintaining 32 concurrent writes of 4194304 bytes for at least 300 seconds.

2012-11-19 21:35:01.722143min lat: 0.255396 max lat: 8.40212 avg lat: 1.14076
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   300  32  8414  8382   111.737   104  0.502774   1.14076
 Total time run: 300.814954
Total writes made:  8414
Write size: 4194304
Bandwidth (MB/sec): 111.883

Stddev Bandwidth:   7.4274
Max bandwidth (MB/sec): 132
Min bandwidth (MB/sec): 56
Average Latency:1.14352
Stddev Latency: 1.18344
Max latency:8.40212
Min latency:0.255396



-RADOS Bench with 4M block size seq:

# rados -p bench bench 300 seq -t 32 --no-cleanup

2012-11-19 21:40:35.128728min lat: 0.224415 max lat: 6.14781 avg lat: 1.1591
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   300  31  8284  8253   110.021   108   1.876981.1591
 Total time run:300.931287
Total reads made: 8285
Read size:4194304
Bandwidth (MB/sec):110.125

Average Latency:   1.16177
Max latency:   6.14781
Min latency:   0.224415


-RBD FIO test, as you recommend I used 4M block size for seq tests for
the first test. See below the fio configuration file used:

[global]
ioengine=libaio
iodepth=4
size=1G
runtime=60
filename=/dev/rbd1

[seq-read]
rw=read
bs=4M
stonewall
direct=1

[rand-read]
rw=randread
bs=4K
stonewall
direct=1

[seq-write]
rw=write
bs=4M
stonewall
direct=1

[rand-write]
rw=randwrite
bs=4K
stonewall
direct=1


Results iodepth 4 and 1G file:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=4
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
seq-write: (g=2): rw=write, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=4
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [64.2% done] [0K/2588K /s] [0 /632  iops] [eta 01m:18s]
seq-read: (groupid=0, jobs=1): err= 0: pid=10586
  read : io=1024.0MB, bw=110656KB/s, iops=27 , runt=  9476msec
slat (usec): min=250 , max=1812 , avg=389.88, stdev=178.26
clat (msec): min=37 , max=615 , avg=147.42, stdev=102.77
 lat (msec): min=38 , max=615 , avg=147.81, stdev=102.77
bw (KB/s) : min=84216, max=122390, per=99.60%, avg=110208.50, stdev=9149.98
  cpu  : usr=0.00%, sys=0.97%, ctx=1552, majf=0, minf=4119
  IO depths: 1=0.4%, 2=0.8%, 4=98.8%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 issued r/w/d: total=256/0/0, short=0/0/0

 lat (msec): 50=4.69%, 100=31.64%, 250=50.78%, 500=11.72%, 750=1.17%
rand-read: (groupid=1, jobs=1): err= 0: pid=10868
  read : io=161972KB, bw=2697.1KB/s, iops=674 , runt= 60036msec
slat (usec): min=12 , max=346 , avg=39.89, stdev=10.04
clat (usec): min=570 , max=50215 , avg=5885.64, stdev=12119.46
 lat (usec): min=601 , max=50258 , avg=5926.07, stdev=12117.44
bw (KB/s) : min= 2015, max= 3356, per=100.15%, avg=2701.03, stdev=276.41
  cpu  : usr=0.51%, sys=2.14%, ctx=66054, majf=0, minf=26
  IO depths: 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 issued r/w/d: total=40493/0/0, short=0/0/0
 lat (usec): 750=3.69%, 1000=60.21%
 lat (msec): 2=19.37%, 4=1.49%, 10=1.30%, 20=0.30%, 50=13.64%
 lat (msec): 100=0.01%
seq-write: (groupid=2, jobs=1): err= 0: pid=12619
  write: io=1024.0MB, bw=112412KB/s, iops=27 , runt=  9328msec
slat (usec): min=510 , max=1683 , avg=820.63, stdev=150.32
clat (msec): min=47 , max=744 , avg=144.21, stdev=73.99
 lat (msec): min=48 , max=744 , avg=145.03, stdev=74.00
bw (KB/s) : min=103193, max=124830, per=100.87%, avg=113390.71,
stdev=6178.93
  cpu  : usr=1.46%, sys=0.81%, ctx=267, majf=0, minf=21
  IO depths: 1=0.4%, 2=0.8%, 4=98.8%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 issued r/w/d: total=0/256/0, short=0/0/0

 lat (msec): 50=0.78%, 100=17.97%, 250=75.39%, 500=5.08%, 750=0.78%
rand-write: (groupid=3, jobs=1): err= 0: pid=12934
  write: io=125352KB, bw=2088.1KB/s, iops=522 , runt= 60007msec
slat (usec): min=13 , max=388 , avg=50.47, stdev=13.73
clat (msec): min=1 , max=1271 , avg= 7.60, stdev=22.16
 lat (msec): min=1 , max=1271 , avg= 7.66, stdev=22.16
bw (KB/s) : min=  155, max= 2944, per=102.13%, avg=2132.45, 

Can't start ceph mon

2012-11-19 Thread Dave Humphreys (Datatone)

I have a problem in which I can't start my ceph monitor. The log is shown below.

The log shows version 0.54. I was running 0.52 when the problem arose, and I 
moved to the latest in case the newer version fixed the problem.

The original failure happened a week or so ago, and could have been as a result 
of running out of disk space when the ceph monitor log became huge.

What should I do to recover the situation?


David





2012-11-19 20:38:51.598468 7fc13fdc6780  0 ceph version 0.54 
(commit:60b84b095b1009a305d4d6a5b16f88571cbd3150), process ceph-mon, pid 21012
2012-11-19 20:38:51.598482 7fc13fdc6780  1 store(/ceph/mon.vault01) mount
2012-11-19 20:38:51.598527 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 21
2012-11-19 20:38:51.598542 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
magic = 21 bytes
2012-11-19 20:38:51.598562 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 75
2012-11-19 20:38:51.598567 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
feature_set = 75 bytes
2012-11-19 20:38:51.598582 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 205
2012-11-19 20:38:51.598586 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
monmap/latest = 205 bytes
2012-11-19 20:38:51.598809 7fc13fdc6780  1 -- 10.0.1.1:6789/0 learned my addr 
10.0.1.1:6789/0
2012-11-19 20:38:51.598818 7fc13fdc6780  1 accepter.accepter.bind my_inst.addr 
is 10.0.1.1:6789/0 need_addr=0
2012-11-19 20:38:51.599498 7fc13fdc6780  1 -- 10.0.1.1:6789/0 messenger.start
2012-11-19 20:38:51.599508 7fc13fdc6780  1 accepter.accepter.start
2012-11-19 20:38:51.599610 7fc13fdc6780  1 mon.vault01@-1(probing) e1 init fsid 
4d7d8d20-338c-4bdc-9918-9bcf04f9a832
2012-11-19 20:38:51.599674 7fc13cdbe700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213c6c0 sd=14 :6789 pgs=0 cs=0 l=0).accept sd=14
2012-11-19 20:38:51.599678 7fc141eff700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213c240 sd=9 :6789 pgs=0 cs=0 l=0).accept sd=9
2012-11-19 20:38:51.599718 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 37
2012-11-19 20:38:51.599723 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
cluster_uuid = 37 bytes
2012-11-19 20:38:51.599718 7fc13ccbd700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213c480 sd=19 :6789 pgs=0 cs=0 l=0).accept sd=19
2012-11-19 20:38:51.599729 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
check_fsid cluster_uuid contains '4d7d8d20-338c-4bdc-9918-9bcf04f9a832'
2012-11-19 20:38:51.599739 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 75
2012-11-19 20:38:51.599745 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
feature_set = 75 bytes
2012-11-19 20:38:51.599751 7fc13fdc6780 10 mon.vault01@-1(probing) e1 features 
compat={},rocompat={},incompat={1=initial feature set (~v.18)}
2012-11-19 20:38:51.599759 7fc13fdc6780 15 store(/ceph/mon.vault01) exists_bl 
joined
2012-11-19 20:38:51.599769 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
has_ever_joined = 1
2012-11-19 20:38:51.599794 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int 
pgmap/last_committed = 13
2012-11-19 20:38:51.599801 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int 
pgmap/first_committed = 132833
2012-11-19 20:38:51.599810 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 239840
2012-11-19 20:38:51.599928 7fc13cbbc700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213cd80 sd=20 :6789 pgs=0 cs=0 l=0).accept sd=20
2012-11-19 20:38:51.599950 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
pgmap/latest = 239840 bytes
--- begin dump of recent events ---2012-11-19 20:38:51.600509 7fc13fdc6780 -1 
*** Caught signal (Aborted) **
 in thread 7fc13fdc6780

 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150)
 1: ceph-mon() [0x53adf8]
 2: (()+0xfe90) [0x7fc141830e90]
 3: (gsignal()+0x3e) [0x7fc140016dae]
 4: (abort()+0x17b) [0x7fc14001825b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc141af300d]
 6: (()+0xb31b6) [0x7fc141af11b6]
 7: (()+0xb31e3) [0x7fc141af11e3]
 8: (()+0xb32de) [0x7fc141af12de]
 9: ceph-mon() [0x5ecb9f]
 10: (Paxos::get_stashed(ceph::buffer::list)+0x1ed) [0x49e28d]
 11: (Paxos::init()+0x109) [0x49e609]
 12: (Monitor::init()+0x36a) [0x485a4a]
 13: (main()+0x1289) [0x46d909]
 14: (__libc_start_main()+0xed) [0x7fc14000364d]
 15: ceph-mon() [0x46fa09]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

   -55 2012-11-19 20:38:51.596694 7fc13fdc6780  5 asok(0x213d000) 
register_command perfcounters_dump hook 0x2131050
   -55 2012-11-19 20:38:51.596720 7fc13fdc6780  5 asok(0x213d000) 
register_command 1 hook 0x2131050
   -54 2012-11-19 20:38:51.596725 7fc13fdc6780  5 asok(0x213d000) 
register_command perf dump hook 0x2131050
   -53 2012-11-19 20:38:51.596735 7fc13fdc6780  5 asok(0x213d000) 
register_command perfcounters_schema hook 0x2131050
   -52 2012-11-19 20:38:51.596740 7fc13fdc6780  5 asok(0x213d000) 
register_command 2 hook 0x2131050
   -51 2012-11-19 20:38:51.596745 7fc13fdc6780  5 asok(0x213d000) 
register_command perf schema hook 0x2131050
   -50 2012-11-19 

Cannot Start Ceph Mon

2012-11-19 Thread Dave Humphreys (Datatone)
(Apologies if this is seen to be a repeat posting: I think that the last 
attempt fell into the void).

I can't start my ceph monitor. The log is below.

Though this shows version 0.54, the problem arose whilst using 0.52. Something 
may have become corrupted when the disk space ran out due to an immense ceph 
mon log.

Is there anything I can do to recover the situation?

Regards,
David


bash-4.1# cat /var/log/ceph/mon.vault01.log 
2012-11-19 20:38:51.598468 7fc13fdc6780  0 ceph version 0.54 
(commit:60b84b095b1009a305d4d6a5b16f88571cbd3150), process ceph-mon, pid 21012
2012-11-19 20:38:51.598482 7fc13fdc6780  1 store(/ceph/mon.vault01) mount
2012-11-19 20:38:51.598527 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 21
2012-11-19 20:38:51.598542 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
magic = 21 bytes
2012-11-19 20:38:51.598562 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 75
2012-11-19 20:38:51.598567 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
feature_set = 75 bytes
2012-11-19 20:38:51.598582 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 205
2012-11-19 20:38:51.598586 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
monmap/latest = 205 bytes
2012-11-19 20:38:51.598809 7fc13fdc6780  1 -- 10.0.1.1:6789/0 learned my addr 
10.0.1.1:6789/0
2012-11-19 20:38:51.598818 7fc13fdc6780  1 accepter.accepter.bind my_inst.addr 
is 10.0.1.1:6789/0 need_addr=0
2012-11-19 20:38:51.599498 7fc13fdc6780  1 -- 10.0.1.1:6789/0 messenger.start
2012-11-19 20:38:51.599508 7fc13fdc6780  1 accepter.accepter.start
2012-11-19 20:38:51.599610 7fc13fdc6780  1 mon.vault01@-1(probing) e1 init fsid 
4d7d8d20-338c-4bdc-9918-9bcf04f9a832
2012-11-19 20:38:51.599674 7fc13cdbe700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213c6c0 sd=14 :6789 pgs=0 cs=0 l=0).accept sd=14
2012-11-19 20:38:51.599678 7fc141eff700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213c240 sd=9 :6789 pgs=0 cs=0 l=0).accept sd=9
2012-11-19 20:38:51.599718 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 37
2012-11-19 20:38:51.599723 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
cluster_uuid = 37 bytes
2012-11-19 20:38:51.599718 7fc13ccbd700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213c480 sd=19 :6789 pgs=0 cs=0 l=0).accept sd=19
2012-11-19 20:38:51.599729 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
check_fsid cluster_uuid contains '4d7d8d20-338c-4bdc-9918-9bcf04f9a832'
2012-11-19 20:38:51.599739 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 75
2012-11-19 20:38:51.599745 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
feature_set = 75 bytes
2012-11-19 20:38:51.599751 7fc13fdc6780 10 mon.vault01@-1(probing) e1 features 
compat={},rocompat={},incompat={1=initial feature set (~v.18)}
2012-11-19 20:38:51.599759 7fc13fdc6780 15 store(/ceph/mon.vault01) exists_bl 
joined
2012-11-19 20:38:51.599769 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
has_ever_joined = 1
2012-11-19 20:38:51.599794 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int 
pgmap/last_committed = 13
2012-11-19 20:38:51.599801 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int 
pgmap/first_committed = 132833
2012-11-19 20:38:51.599810 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 239840
2012-11-19 20:38:51.599928 7fc13cbbc700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213cd80 sd=20 :6789 pgs=0 cs=0 l=0).accept sd=20
2012-11-19 20:38:51.599950 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
pgmap/latest = 239840 bytes
--- begin dump of recent events ---2012-11-19 20:38:51.600509 7fc13fdc6780 -1 
*** Caught signal (Aborted) **
 in thread 7fc13fdc6780

 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150)
 1: ceph-mon() [0x53adf8]
 2: (()+0xfe90) [0x7fc141830e90]
 3: (gsignal()+0x3e) [0x7fc140016dae]
 4: (abort()+0x17b) [0x7fc14001825b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc141af300d]
 6: (()+0xb31b6) [0x7fc141af11b6]
 7: (()+0xb31e3) [0x7fc141af11e3]
 8: (()+0xb32de) [0x7fc141af12de]
 9: ceph-mon() [0x5ecb9f]
 10: (Paxos::get_stashed(ceph::buffer::list)+0x1ed) [0x49e28d]
 11: (Paxos::init()+0x109) [0x49e609]
 12: (Monitor::init()+0x36a) [0x485a4a]
 13: (main()+0x1289) [0x46d909]
 14: (__libc_start_main()+0xed) [0x7fc14000364d]
 15: ceph-mon() [0x46fa09]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

   -55 2012-11-19 20:38:51.596694 7fc13fdc6780  5 asok(0x213d000) 
register_command perfcounters_dump hook 0x2131050
   -55 2012-11-19 20:38:51.596720 7fc13fdc6780  5 asok(0x213d000) 
register_command 1 hook 0x2131050
   -54 2012-11-19 20:38:51.596725 7fc13fdc6780  5 asok(0x213d000) 
register_command perf dump hook 0x2131050
   -53 2012-11-19 20:38:51.596735 7fc13fdc6780  5 asok(0x213d000) 
register_command perfcounters_schema hook 0x2131050
   -52 2012-11-19 20:38:51.596740 7fc13fdc6780  5 asok(0x213d000) 
register_command 2 hook 0x2131050
   -51 2012-11-19 20:38:51.596745 7fc13fdc6780  5 asok(0x213d000) 
register_command perf schema hook 0x2131050
   -50 

librbd discard bug problems - i got it

2012-11-19 Thread Stefan Priebe

Hello Josh,

after digging three days around i got it.

The problem is in aio_discard in internal.cc. The i/o fails when AioZero 
or AioTruncate is used.


It works fine with AioRemove. It seems to depend on overlapping. 
Hopefully i'm able to provide a patch this nicht.


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Can't Start Ceph Mon

2012-11-19 Thread Dave Humphreys (Bob)
I can't start my ceph monitor, the log is attached below.

Whilst the log shows 0.54, the problem arose with 0.52, and may have been 
caused when disk space ran out as a result of a huge set of ceph log files.

Is there a way to recover?

Ragards,
David





bash-4.1# cat /var/log/ceph/mon.vault01.log 
2012-11-19 20:38:51.598468 7fc13fdc6780  0 ceph version 0.54 
(commit:60b84b095b1009a305d4d6a5b16f88571cbd3150), process ceph-mon, pid 21012
2012-11-19 20:38:51.598482 7fc13fdc6780  1 store(/ceph/mon.vault01) mount
2012-11-19 20:38:51.598527 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 21
2012-11-19 20:38:51.598542 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
magic = 21 bytes
2012-11-19 20:38:51.598562 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 75
2012-11-19 20:38:51.598567 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
feature_set = 75 bytes
2012-11-19 20:38:51.598582 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 205
2012-11-19 20:38:51.598586 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
monmap/latest = 205 bytes
2012-11-19 20:38:51.598809 7fc13fdc6780  1 -- 10.0.1.1:6789/0 learned my addr 
10.0.1.1:6789/0
2012-11-19 20:38:51.598818 7fc13fdc6780  1 accepter.accepter.bind my_inst.addr 
is 10.0.1.1:6789/0 need_addr=0
2012-11-19 20:38:51.599498 7fc13fdc6780  1 -- 10.0.1.1:6789/0 messenger.start
2012-11-19 20:38:51.599508 7fc13fdc6780  1 accepter.accepter.start
2012-11-19 20:38:51.599610 7fc13fdc6780  1 mon.vault01@-1(probing) e1 init fsid 
4d7d8d20-338c-4bdc-9918-9bcf04f9a832
2012-11-19 20:38:51.599674 7fc13cdbe700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213c6c0 sd=14 :6789 pgs=0 cs=0 l=0).accept sd=14
2012-11-19 20:38:51.599678 7fc141eff700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213c240 sd=9 :6789 pgs=0 cs=0 l=0).accept sd=9
2012-11-19 20:38:51.599718 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 37
2012-11-19 20:38:51.599723 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
cluster_uuid = 37 bytes
2012-11-19 20:38:51.599718 7fc13ccbd700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213c480 sd=19 :6789 pgs=0 cs=0 l=0).accept sd=19
2012-11-19 20:38:51.599729 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
check_fsid cluster_uuid contains '4d7d8d20-338c-4bdc-9918-9bcf04f9a832'
2012-11-19 20:38:51.599739 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 75
2012-11-19 20:38:51.599745 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
feature_set = 75 bytes
2012-11-19 20:38:51.599751 7fc13fdc6780 10 mon.vault01@-1(probing) e1 features 
compat={},rocompat={},incompat={1=initial feature set (~v.18)}
2012-11-19 20:38:51.599759 7fc13fdc6780 15 store(/ceph/mon.vault01) exists_bl 
joined
2012-11-19 20:38:51.599769 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
has_ever_joined = 1
2012-11-19 20:38:51.599794 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int 
pgmap/last_committed = 13
2012-11-19 20:38:51.599801 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int 
pgmap/first_committed = 132833
2012-11-19 20:38:51.599810 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at 
off 0 of 239840
2012-11-19 20:38:51.599928 7fc13cbbc700  1 -- 10.0.1.1:6789/0  :/0 
pipe(0x213cd80 sd=20 :6789 pgs=0 cs=0 l=0).accept sd=20
2012-11-19 20:38:51.599950 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
pgmap/latest = 239840 bytes
--- begin dump of recent events ---2012-11-19 20:38:51.600509 7fc13fdc6780 -1 
*** Caught signal (Aborted) **
 in thread 7fc13fdc6780

 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150)
 1: ceph-mon() [0x53adf8]
 2: (()+0xfe90) [0x7fc141830e90]
 3: (gsignal()+0x3e) [0x7fc140016dae]
 4: (abort()+0x17b) [0x7fc14001825b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc141af300d]
 6: (()+0xb31b6) [0x7fc141af11b6]
 7: (()+0xb31e3) [0x7fc141af11e3]
 8: (()+0xb32de) [0x7fc141af12de]
 9: ceph-mon() [0x5ecb9f]
 10: (Paxos::get_stashed(ceph::buffer::list)+0x1ed) [0x49e28d]
 11: (Paxos::init()+0x109) [0x49e609]
 12: (Monitor::init()+0x36a) [0x485a4a]
 13: (main()+0x1289) [0x46d909]
 14: (__libc_start_main()+0xed) [0x7fc14000364d]
 15: ceph-mon() [0x46fa09]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

   -55 2012-11-19 20:38:51.596694 7fc13fdc6780  5 asok(0x213d000) 
register_command perfcounters_dump hook 0x2131050
   -55 2012-11-19 20:38:51.596720 7fc13fdc6780  5 asok(0x213d000) 
register_command 1 hook 0x2131050
   -54 2012-11-19 20:38:51.596725 7fc13fdc6780  5 asok(0x213d000) 
register_command perf dump hook 0x2131050
   -53 2012-11-19 20:38:51.596735 7fc13fdc6780  5 asok(0x213d000) 
register_command perfcounters_schema hook 0x2131050
   -52 2012-11-19 20:38:51.596740 7fc13fdc6780  5 asok(0x213d000) 
register_command 2 hook 0x2131050
   -51 2012-11-19 20:38:51.596745 7fc13fdc6780  5 asok(0x213d000) 
register_command perf schema hook 0x2131050
   -50 2012-11-19 20:38:51.596752 7fc13fdc6780  5 asok(0x213d000) 
register_command config show hook 0x2131050
   -49 2012-11-19 20:38:51.596756 

Re: Can't start ceph mon

2012-11-19 Thread Gregory Farnum
On Mon, Nov 19, 2012 at 1:08 PM, Dave Humphreys (Datatone)
d...@datatone.co.uk wrote:

 I have a problem in which I can't start my ceph monitor. The log is shown 
 below.

 The log shows version 0.54. I was running 0.52 when the problem arose, and I 
 moved to the latest in case the newer version fixed the problem.

 The original failure happened a week or so ago, and could have been as a 
 result of running out of disk space when the ceph monitor log became huge.

That is almost certainly the case, although I thought we were handling
this issue better now.

 What should I do to recover the situation?

Do you have other monitors in working order? The easiest way to handle
it if that's the case is just to remove this monitor from the cluster
and add it back in as a new monitor with a fresh store. If not we can
look into reconstructing it.
-Greg



 David





 2012-11-19 20:38:51.598468 7fc13fdc6780  0 ceph version 0.54 
 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150), process ceph-mon, pid 21012
 2012-11-19 20:38:51.598482 7fc13fdc6780  1 store(/ceph/mon.vault01) mount
 2012-11-19 20:38:51.598527 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 21
 2012-11-19 20:38:51.598542 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 magic = 21 bytes
 2012-11-19 20:38:51.598562 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 75
 2012-11-19 20:38:51.598567 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 feature_set = 75 bytes
 2012-11-19 20:38:51.598582 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 205
 2012-11-19 20:38:51.598586 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 monmap/latest = 205 bytes
 2012-11-19 20:38:51.598809 7fc13fdc6780  1 -- 10.0.1.1:6789/0 learned my addr 
 10.0.1.1:6789/0
 2012-11-19 20:38:51.598818 7fc13fdc6780  1 accepter.accepter.bind 
 my_inst.addr is 10.0.1.1:6789/0 need_addr=0
 2012-11-19 20:38:51.599498 7fc13fdc6780  1 -- 10.0.1.1:6789/0 messenger.start
 2012-11-19 20:38:51.599508 7fc13fdc6780  1 accepter.accepter.start
 2012-11-19 20:38:51.599610 7fc13fdc6780  1 mon.vault01@-1(probing) e1 init 
 fsid 4d7d8d20-338c-4bdc-9918-9bcf04f9a832
 2012-11-19 20:38:51.599674 7fc13cdbe700  1 -- 10.0.1.1:6789/0  :/0 
 pipe(0x213c6c0 sd=14 :6789 pgs=0 cs=0 l=0).accept sd=14
 2012-11-19 20:38:51.599678 7fc141eff700  1 -- 10.0.1.1:6789/0  :/0 
 pipe(0x213c240 sd=9 :6789 pgs=0 cs=0 l=0).accept sd=9
 2012-11-19 20:38:51.599718 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 37
 2012-11-19 20:38:51.599723 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 cluster_uuid = 37 bytes
 2012-11-19 20:38:51.599718 7fc13ccbd700  1 -- 10.0.1.1:6789/0  :/0 
 pipe(0x213c480 sd=19 :6789 pgs=0 cs=0 l=0).accept sd=19
 2012-11-19 20:38:51.599729 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
 check_fsid cluster_uuid contains '4d7d8d20-338c-4bdc-9918-9bcf04f9a832'
 2012-11-19 20:38:51.599739 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 75
 2012-11-19 20:38:51.599745 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 feature_set = 75 bytes
 2012-11-19 20:38:51.599751 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
 features compat={},rocompat={},incompat={1=initial feature set (~v.18)}
 2012-11-19 20:38:51.599759 7fc13fdc6780 15 store(/ceph/mon.vault01) exists_bl 
 joined
 2012-11-19 20:38:51.599769 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
 has_ever_joined = 1
 2012-11-19 20:38:51.599794 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int 
 pgmap/last_committed = 13
 2012-11-19 20:38:51.599801 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int 
 pgmap/first_committed = 132833
 2012-11-19 20:38:51.599810 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 239840
 2012-11-19 20:38:51.599928 7fc13cbbc700  1 -- 10.0.1.1:6789/0  :/0 
 pipe(0x213cd80 sd=20 :6789 pgs=0 cs=0 l=0).accept sd=20
 2012-11-19 20:38:51.599950 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 pgmap/latest = 239840 bytes
 --- begin dump of recent events ---2012-11-19 20:38:51.600509 7fc13fdc6780 -1
 *** Caught signal (Aborted) **
  in thread 7fc13fdc6780

  ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150)
  1: ceph-mon() [0x53adf8]
  2: (()+0xfe90) [0x7fc141830e90]
  3: (gsignal()+0x3e) [0x7fc140016dae]
  4: (abort()+0x17b) [0x7fc14001825b]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc141af300d]
  6: (()+0xb31b6) [0x7fc141af11b6]
  7: (()+0xb31e3) [0x7fc141af11e3]
  8: (()+0xb32de) [0x7fc141af12de]
  9: ceph-mon() [0x5ecb9f]
  10: (Paxos::get_stashed(ceph::buffer::list)+0x1ed) [0x49e28d]
  11: (Paxos::init()+0x109) [0x49e609]
  12: (Monitor::init()+0x36a) [0x485a4a]
  13: (main()+0x1289) [0x46d909]
  14: (__libc_start_main()+0xed) [0x7fc14000364d]
  15: ceph-mon() [0x46fa09]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.

-55 2012-11-19 20:38:51.596694 7fc13fdc6780  5 asok(0x213d000) 
 register_command perfcounters_dump hook 0x2131050
-55 2012-11-19 20:38:51.596720 

Re: Files lost after mds rebuild

2012-11-19 Thread Gregory Farnum
On Mon, Nov 19, 2012 at 7:55 AM, Drunkard Zhang gongfan...@gmail.com wrote:
 I created a ceph cluster for test, here's mistake I made:
 Add a second mds: mds.ab, executed 'ceph mds set_max_mds 2', then
 removed the mds just added;
 Then 'ceph mds set_max_mds 1', the first mds.aa crashed, and became laggy.
 As I can't repair mds.aa, so did 'ceph mds newfs metadata data
 --yes-i-really-mean-it';

So this command is a mkfs sort of thing. It's deleted all the
allocation tables and filesystem metadata in favor of new, empty
ones. You should not run --yes-i-really-mean-it commands if you
don't know exactly what the command is doing and why you're using it.

 mds.aa was back, but 1TB data was in cluster lost, but disk space
 still used, by 'ceps -s'.

 Is there any chance I can get my data back? If can't, how can I
 retrieve back the disk space.

There's not currently a great way to get that data back. With
sufficient energy it could be re-constructed by looking through all
the RADOS objects and putting something together.
To retrieve the disk space, you'll want to delete the data and
metadata RADOS pools. This will of course *eliminate* the data you
have in your new filesystem, so grab that out first if there's
anything there you care about. Then create the pools and run the newfs
command again.
Also, you've got the syntax wrong on that newfs command. You should be
using pool IDs:
ceph mds newfs 1 0 --yes-i-really-mean-it
(Though these IDs may change after re-creating the pools.)
-Greg


 Now it looks like:
 log3 ~ # ceph -s
health HEALTH_OK
monmap e1: 1 mons at {log3=10.205.119.2:6789/0}, election epoch 0,
 quorum 0 log3
osdmap e1555: 28 osds: 20 up, 20 in
 pgmap v56518: 960 pgs: 960 active+clean; 1134 GB data, 2306 GB
 used, 51353 GB / 55890 GB avail
mdsmap e703: 1/1/1 up {0=aa=up:active}, 1 up:standby

 log3 ~ # df | grep osd |sort
 /dev/sdb1   2.8T  124G  2.5T   5% /ceph/osd.0
 /dev/sdc1   2.8T  104G  2.6T   4% /ceph/osd.1
 /dev/sdd1   2.8T   84G  2.6T   4% /ceph/osd.2
 /dev/sde1   2.8T  117G  2.6T   5% /ceph/osd.3
 /dev/sdf1   2.8T  105G  2.6T   4% /ceph/osd.4
 /dev/sdg1   2.8T   84G  2.6T   4% /ceph/osd.5
 /dev/sdh1   2.8T  140G  2.5T   6% /ceph/osd.6
 /dev/sdi1   2.8T  134G  2.5T   5% /ceph/osd.8
 /dev/sdj1   2.8T  112G  2.6T   5% /ceph/osd.7
 /dev/sdk1   2.8T  159G  2.5T   6% /ceph/osd.9
 /dev/sdl1   2.8T  126G  2.5T   5% /ceph/osd.10

 osd on another host didn't show.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is the disk on MDS used for journal?

2012-11-19 Thread Gregory Farnum
On Sun, Nov 18, 2012 at 7:14 PM, liu yaqi liuyaqiy...@gmail.com wrote:
 Is the disk on MDS used for journal? Does it has some other use?

The MDS doesn't make any use of local disk space — it stores
everything in RADOS. You need enough local disk to provide a
configuration file, keyring, and debug logging (if you want those
things).
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD network failure

2012-11-19 Thread Gregory Farnum
On Fri, Nov 16, 2012 at 5:56 PM, Josh Durgin josh.dur...@inktank.com wrote:
 On 11/15/2012 01:51 AM, Gandalf Corvotempesta wrote:

 2012/11/15 Josh Durgin josh.dur...@inktank.com:

 So basically you'd only need a single nic per storage node. Multiple
 can be useful to separate frontend and backend traffic, but ceph
 is designed to maintain strong consistency when failures occur.


 Probably i've not exaplained well.
 I'll have multiple nics, one for frontend, one for backend used as ODS
 sync network.
 What happens in case of backend network failure? The frontend network
 is still ok, OSD is
 still reachable but is not able to sync datas.


 Ah, ok. By default, the OSDs use the backend network for heartbeats,
 so if it fails, they will notice and report peers they can't reach as
 failed to the monitors, and the normal failure handling takes care
 of things.

 If you're worried about consistency, remember that a write won't
 complete until it's on disk on all replicas. If you're interested
 in the gory details of maintaining consistency, check out the peering
 process [1].

 Josh

 [1] http://ceph.com/docs/master/dev/peering/

Actually, right now a failed cluster and an up public network is
something the OSDs do not handle well — they will mark each other down
on the monitor and then tell the monitor hey, I'm not dead! and
start flapping pretty horrendously. We first ran across it a couple
weeks ago and have started to think about it, but I'm not sure a fix
for this is going to make it into the initial Bobtail release. :(
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Unused doc/images/.jpg files

2012-11-19 Thread Snider, Tim
Hi  - There are several jpg files in the doc/images directory of the tarball 
that don't seem to be used in the html files or man pages after docs are built. 
If they are used somewhere - where is that  what am I missing?
Some of the .png files are used.

root@84Server:~/ceph-ceph-fd4b839# ls doc/images/
AccessMethods.jpg  RADOS.jpg chef.png  lightstack.png  
radosStack.svg  techstack.png
CEPHConfig.jpg RBD.jpg   chef.svg  lightstack.svg  
stack.png   techstack.svg
CRUSH.jpg  RDBSnapshots.jpg  docreviewprocess.jpg  osdStack.svg
stack.svg

Server:~/ceph-ceph-fd4b839# grep -R osdStack.svg *
Server:~/ceph-ceph-fd4b839# grep -R techstack.png *
doc/images/techstack.svg:   
inkscape:export-filename=/home/johnw/ceph/doc/images/techstack.png
doc/images/radosStack.svg:   
inkscape:export-filename=/home/johnw/ceph/doc/images/techstack.png

Server:~/ceph-ceph-fd4b839# grep -R stack.png *
Binary file build-doc/doctrees/index.doctree matches
Binary file build-doc/doctrees/environment.pickle matches
build-doc/output/html/index.html:img alt=_images/stack.png 
src=_images/stack.png /
build-doc/output/html/_sources/index.txt:.. image:: images/stack.png
doc/index.rst:.. image:: images/stack.png
doc/images/techstack.svg:   
inkscape:export-filename=/home/johnw/ceph/doc/images/techstack.png
doc/images/radosStack.svg:   
inkscape:export-filename=/home/johnw/ceph/doc/images/techstack.png
doc/images/stack.svg:   
inkscape:export-filename=/home/johnw/ceph/doc/images/stack.png
doc/images/lightstack.svg:   
inkscape:export-filename=/home/johnw/ceph/doc/images/lightstack.png

/tmp/ceph-ceph-fd4b839
Server:~/ceph-ceph-fd4b839# find . -name *.jpg -print
./doc/images/RADOS.jpg
./doc/images/CRUSH.jpg
./doc/images/AccessMethods.jpg
./doc/images/docreviewprocess.jpg
./doc/images/CEPHConfig.jpg
./doc/images/RDBSnapshots.jpg
./doc/images/RBD.jpg

Server:~/ceph-ceph-fd4b839# grep -R AccessMethods *
Server:~/ceph-ceph-fd4b839# grep -R CEPHConfig.jpg *
Server:~/ceph-ceph-fd4b839# grep -R RBD.jpg *
Server:~/ceph-ceph-fd4b839# grep -R RADOS.jpg *

Thanks,
Tim
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deprecating mkcephfs (the arrival of light-weight deployment tools)

2012-11-19 Thread Sage Weil
On Mon, 19 Nov 2012, Isaac Otsiabah wrote:
 
 I am trying to understand ceph deployment direction because from this link 
 http://ceph.com/docs/master/rados/deployment/
 it
  is mentioned that mkcephfs is dreprecated. It also has the statement 
 below which mentions light-weight deployment scripts to help you 
 evaluate Ceph.
 
 
 We provide light-weight deployment scripts to help you evaluate Ceph. For
 professional deployment, you should consider professional deployment systems
 such as Juju, Puppet, Chef or Crowbar.
 
 I  think there is a need to have native ceph deployment tools that aren't 
 dependent upon any third party deployment tools. So my question is this
     
     1. when will the light-weight deployment scripts be 
 available and which ceph version will they be released into?

http://github.com/ceph/ceph-deploy is available for initial testing, but 
far from ready for widespread use.  mkcephfs is still the preferred 
installation path.

I'll make sure the 'deprecated' notation is removed until a real 
alternative is ready.

     2. Now, going forward, when will mkcephfs not work anymore (what 
 ceph version)?

It will be maintained at least through cuttlefish (the next stable 
release), though probably longer, so that there is plenty of overlap with 
whatever tool will follow.

sage

Re: some snapshot problems

2012-11-19 Thread Gregory Farnum
On Sun, Nov 11, 2012 at 11:02 PM, liu yaqi liuyaqiy...@gmail.com wrote:
 2012/11/9 Sage Weil s...@inktank.com

 Lots of different snapshots:

  - librados lets you do 'selfmanaged snaps' in its API, which let an
application control which snapshots apply to which objects.
  - you can create a 'pool' snapshot on an entire librados pool.  this
cannot be used at the same time as rbd, fs, or the above 'selfmanaged'
snaps.
  - rbd let's you snapshot block device images (by usuing the librados
selfmanaged snap API).
  - the ceph file system let's you snapshot any subdirectory (again
utilizing the underlying RADOS functionality).

 I am confused about the concept of pool and image. Is one pool a
 set of placement groups? When I snap an image, does it mean a snapshot
 of one disk?

A pool is a logical namespace into which you place objects. Placement
groups are shards of a pool.
Snapping an image makes use of the self-managed snapshot
infrastructure, and takes a snapshot of one RBD volume (so yes, if
that's what you meant by a snapshot of one disk).

 I think snapshot is used to preserve the state of directory at one
 time, and I wander if there will be a situation that I preserve the
 data of the directory, but does not preserve the metadata of the
 directory——maybe because metadata and data not in the same pool, would
 this happen?

The Ceph filesystem builds a bit more on top of the RADOS snapshots —
metadata and data are almost never in the same pool, and the metadata
snapshots don't use RADOS snapshots anyway.

 When snap directory, I trace the code in mds, there is snapinfo added
 to inode, but where and when to create the content of the snap? What
 is the data sturcture of the snap content? When client set inode
 attribute, if snapid==NOSNAP, return, does this mean if the inode has
 been snapped, it can not be changed? So, snap not using the
 copy-on-write method(create snap, then change the content of snapfile
 when set inode attribute or write the file)? If not copy-on-write,
 what's the snap workflow for directory?

You want to look into the code surrounding SnapRealms to see how the
metadata for snapshots is managed.

 There are multi metadata nodes, one directory may  spread over multi
 servers, one server has one part of the dir, how ceph resolves this
 problem? This also cause clock problem.

It's not easy, but again, look at how the SnapRealms are dealt with.
The MDSes will do synchronous notifications to each other.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-19 Thread Nick Bartos
Making 'mon clock drift allowed' very small (0.1) does not
reliably reproduce the hang.  I started looking at the code for 0.48.2
and it looks like this is only used in Paxos::warn_on_future_time,
which only handles the warning, nothing else.


On Fri, Nov 16, 2012 at 2:21 PM, Sage Weil s...@inktank.com wrote:
 On Fri, 16 Nov 2012, Nick Bartos wrote:
 Should I be lowering the clock drift allowed, or the lease interval to
 help reproduce it?

 clock drift allowed.




 On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil s...@inktank.com wrote:
  You can safely set the clock drift allowed as high as 500ms.  The real
  limitation is that it needs to be well under the lease interval, which is
  currently 5 seconds by default.
 
  You might be able to reproduce more easily by lowering the threshold...
 
  sage
 
 
  On Fri, 16 Nov 2012, Nick Bartos wrote:
 
  How far off do the clocks need to be before there is a problem?  It
  would seem to be hard to ensure a very large cluster has all of it's
  nodes synchronized within 50ms (which seems to be the default for mon
  clock drift allowed).  Does the mon clock drift allowed parameter
  change anything other than the log messages?  Are there any other
  tuning options that may help, assuming that this is the issue and it's
  not feasible to get the clocks more than 500ms in sync between all
  nodes?
 
  I'm trying to get a good way of reproducing this and get a trace on
  the ceph processes to see what they're waiting on.  I'll let you know
  when I have more info.
 
 
  On Fri, Nov 16, 2012 at 11:16 AM, Sage Weil s...@inktank.com wrote:
   I just realized I was mixing up this thread with the other deadlock
   thread.
  
   On Fri, 16 Nov 2012, Nick Bartos wrote:
   Turns out we're having the 'rbd map' hang on startup again, after we
   started using the wip-3.5 patch set.  How critical is the
   libceph_protect_ceph_con_open_with_mutex commit?  That's the one I
   removed before which seemed to get rid of the problem (although I'm
   not completely sure if it completely got rid of it, at least seemed to
   happen much less often).
  
   It seems like we only started having this issue after we started
   patching the 3.5 ceph client (we started patching to try and get rid
   of a kernel oops, which the patches seem to have fixed).
  
   Right.  That patch fixes a real bug.  It also seems pretty unlikely that
   this patch is related to the startup hang.  The original log showed 
   clock
   drift on the monitor that could very easily cause this sort of hang.  
   Can
   you confirm that that isn't the case with this recent instance of the
   problem?  And/or attach a log?
  
   Thanks-
   sage
  
  
  
  
   On Thu, Nov 15, 2012 at 4:25 PM, Sage Weil s...@inktank.com wrote:
On Thu, 15 Nov 2012, Nick Bartos wrote:
Sorry I guess this e-mail got missed.  I believe those patches came
from the ceph/linux-3.5.5-ceph branch.  I'm now using the wip-3.5
branch patches, which seem to all be fine.  We'll stick with 3.5 and
this backport for now until we can figure out what's wrong with 3.6.
   
I typically ignore the wip branches just due to the naming when I'm
looking for updates.  Where should I typically look for updates that
aren't in released kernels?  Also, is there anything else in the 
wip*
branches that you think we may find particularly useful?
   
You were looking in the right place.  The problem was we weren't 
super
organized with our stable patches, and changed our minds about what 
to
send upstream.  These are 'wip' in the sense that they were in 
preparation
for going upstream.  The goal is to push them to the mainline stable
kernels and ideally not keep them in our tree at all.
   
wip-3.5 is an oddity because the mainline stable kernel is EOL'd, but
we're keeping it so that ubuntu can pick it up for quantal.
   
I'll make sure these are more clearly marked as stable.
   
sage
   
   
   
   
On Mon, Nov 12, 2012 at 3:16 PM, Sage Weil s...@inktank.com wrote:
 On Mon, 12 Nov 2012, Nick Bartos wrote:
 After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, 
 it
 seems we no longer have this hang.

 Hmm, that's a bit disconcerting.  Did this series come from our 
 old 3.5
 stable series?  I recently prepared a new one that backports 
 *all* of the
 fixes from 3.6 to 3.5 (and 3.4); see wip-3.5 in ceph-client.git.  
 I would
 be curious if you see problems with that.

 So far, with these fixes in place, we have not seen any 
 unexplained kernel
 crashes in this code.

 I take it you're going back to a 3.5 kernel because you weren't 
 able to
 get rid of the sync problem with 3.6?

 sage




 On Thu, Nov 8, 2012 at 5:43 PM, Josh Durgin 
 josh.dur...@inktank.com wrote:
  On 11/08/2012 02:10 PM, Mandell Degerness wrote:
 
  We are seeing a somewhat 

Re: librbd discard bug problems - i got it

2012-11-19 Thread Stefan Priebe
mhm qemu rbd block driver. Get's always these errors back. As 
rbd_aio_bh_cb is directly called from librbd the problem must be there. 
Strangely i can't find where rbd_aio_bh_cb get's called with -512.


ANy further ideas?

rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -1006628352 Error: 0

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd recovery extremely slow with current master

2012-11-19 Thread Gregory Farnum
Which version was this on? There was some fairly significant work to
recovery done to introduce a reservation scheme and some other stuff
that might need some different defaults.
-Greg

On Tue, Nov 13, 2012 at 12:33 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi list,

 osd recovery seems to be really slow with current master.

 I see only 1-8 active+recovering out of 1200. Even there's no load on ceph
 cluster.

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


objectcacher lru eviction causes assert

2012-11-19 Thread Sam Lang


Hi All,

We've been fixing a number of objectcacher bugs to handle races between 
slow osd commit replies and various other operations like truncate.  I 
ran into another problem earlier today with a race between an object 
getting evicted from the lru cache (via readx - trim) and the osd 
commit reply.  The assertion trace is below.


We've avoided keeping a reference to the object during the commit, but 
that means that the object isn't pinned in the lru, and so can come up 
for eviction.  When it gets evicted, we close the object and hit the 
assertion, which we can't do - because we need the object to finish the 
commit.


I've pushed a change that needs review in the wip-3431 branch.  It 
allows the the object to be evicted from the lru cache, but checks that 
it can be closed (as we do elsewhere) - and if not, lets the commit 
handle the close (via flush...release).


The assertion we hit is:

2012-11-19 09:06:35.187910 7ff143e2f780 1 osdc/ObjectCacher.cc: In 
function 'void ObjectCacher::close_object(ObjectCacher::Object*)' thread 
7ff143e2f780 time 2012-11-19 09:06:35.186379

osdc/ObjectCacher.cc: 577: FAILED assert(obcan_close())
ceph version 0.54-641-g4c69f86 (4c69f865ca79328c62635ae32c91bd32b3985613)
 1: (ObjectCacher::close_object(ObjectCacher::Object*)+0x135) [0x5c78d5]
 2: (ObjectCacher::trim(long, long)+0x820) [0x5c94d0]
 3: (ObjectCacher::_readx(ObjectCacher::OSDRead*, 
ObjectCacher::ObjectSet*, Context*, bool)+0x21ad) [0x5d92dd]
 4: (Client::_read_async(Fh*, unsigned long, unsigned long, 
ceph::buffer::list*)+0x3e9) [0x486c09]
 5: (Client::_read(Fh*, long, unsigned long, 
ceph::buffer::list*)+0x265) [0x49bd65]

 6: (Client::ll_read(Fh*, long, long, ceph::buffer::list*)+0x97) [0x49be87]
 7: /tmp/cephtest/binary/usr/local/bin/ceph-fuse() [0x4733cf]
 8: (()+0x12d5e) [0x7ff1439fdd5e]
 9: (fuse_session_loop()+0x75) [0x7ff1439fbd65]
 10: (ceph_fuse_ll_main(Client*, int, char const**, int)+0x225) [0x474245]
 11: (main()+0x42f) [0x4716ef]
 12: (__libc_start_main()+0xed) [0x7ff141ebd76d]
 13: /tmp/cephtest/binary/usr/local/bin/ceph-fuse() [0x472e95]
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Removed directory is back in the Ceph FS

2012-11-19 Thread Gregory Farnum
On Tue, Nov 13, 2012 at 3:23 AM, Franck Marchand fmarch...@agaetis.fr wrote:
 Hi,

 I have a weird pb. I remove a folder using a mounted fs partition. I
 did it and it worked well.

What client are you using? How did you delete it? (rm -rf, etc?) Are
you using multiple clients or one, and did you check it on a different
client?

 I checked later to see if I had all my folders in ceph fs ... : the
 folder I removed was back and I can't remove it ! Here is the error
 message I got :

 rm -rf 2012-11-10/
 rm cannot remove `2012-11-10': Directory not empty

 This folder is empty ...
 So anybody had the same pb ? Am I doing something wrong ?

This sounds like a known but undiagnosed problem with the MDS
rstats. The part where your client reported success is a new
wrinkle, though.
-Greg



 Thx
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: librbd discard bug problems - i got it

2012-11-19 Thread Josh Durgin

On 11/19/2012 03:16 PM, Stefan Priebe wrote:

mhm qemu rbd block driver. Get's always these errors back. As
rbd_aio_bh_cb is directly called from librbd the problem must be there.
Strangely i can't find where rbd_aio_bh_cb get's called with -512.

ANy further ideas?


Two ideas:

1) Is rbd_finish_aiocb getting this same return value?

2) Perhaps it's an issue with the return value wrapping around with
very large discards. Adding some logging of the return values of each
rados operation in AioCompletion::complete_request() might give us a
clue. These large negative return values are suspicious.


rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -512 Error: 0
rbd_aio_bh_cb got error back. Code: -1006628352 Error: 0

Stefan


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-osd crashing (os/FileStore.cc: 4500: FAILED assert(replaying))

2012-11-19 Thread Stefan Priebe

Am 20.11.2012 00:39, schrieb Samuel Just:

Seems to be a truncated log file...  That usually indicates filesystem
corruption.  Anything in dmesg?
-Sam

No. Everything fine.



On Thu, Nov 15, 2012 at 1:07 PM, Stefan Priebe s.pri...@profihost.ag wrote:

Hello list,

actual master incl. upstream/wip-fd-simple-cache results in this crash when
i try to start some of my osds (others work fine) today on multiple nodes:

 -2 2012-11-15 22:04:09.226945 7f3af1c7a780  0 osd.52 pg_epoch: 657
pg[3.3b( v 632'823 (632'823,632'823] n=5 ec=17 les/c 18/18 656/656/17) []
r=0 lpr=0 pi=17-655/2 (info mismatch, log(632'823,0'0]) (log bound mismatch,
empty) lcod 0'0 mlcod 0'0 inactive] Got exception 'read_log_error: read_log
got 0 bytes, expected 126086-0=126086' while reading log. Moving corrupted
log file to 'corrupt_log_2012-11-15_22:04_3.3b' for later analysis.
 -1 2012-11-15 22:04:09.233563 7f3af1c7a780  0 osd.52 pg_epoch: 657
pg[3.557( v 632'753 (0'0,632'753] n=2 ec=17 les/c 18/18 656/656/17) [] r=0
lpr=0 pi=17-655/2 (info mismatch, log(0'0,0'0]) lcod 0'0 mlcod 0'0 inactive]
Got exception 'read_log_error: read_log got 0 bytes, expected
115488-0=115488' while reading log. Moving corrupted log file to
'corrupt_log_2012-11-15_22:04_3.557' for later analysis.
  0 2012-11-15 22:04:09.234536 7f3ae87d0700 -1 os/FileStore.cc: In
function 'int FileStore::_collection_add(coll_t, coll_t, const hobject_t,
const SequencerPosition)' thread 7f3ae87d0700 time 2012-11-15
22:04:09.233672
os/FileStore.cc: 4500: FAILED assert(replaying)

  ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039)
  1: (FileStore::_collection_add(coll_t, coll_t, hobject_t const,
SequencerPosition const)+0x77d) [0x72ff0d]
  2: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long,
int)+0x25fb) [0x73481b]
  3: (FileStore::do_transactions(std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* , unsigned long)+0x4c)
[0x73952c]
  4: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45]
  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x833700]
  7: (()+0x68ca) [0x7f3af16578ca]
  8: (clone()+0x6d) [0x7f3aefac6bfd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to
interpret this.

--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 0 journaler
0/ 5 objectcacher
0/ 5 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
1/ 5 mon
0/ 0 monc
0/ 5 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
0/ 0 heartbeatmap
0/ 0 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent 1
   max_new  100
   log_file /var/log/ceph/ceph-osd.52.log
--- end dump of recent events ---
2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal (Aborted) **
  in thread 7f3ae87d0700

  ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039)
  1: /usr/bin/ceph-osd() [0x799769]
  2: (()+0xeff0) [0x7f3af165fff0]
  3: (gsignal()+0x35) [0x7f3aefa29215]
  4: (abort()+0x180) [0x7f3aefa2c020]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f3af02bddc5]
  6: (()+0xcb166) [0x7f3af02bc166]
  7: (()+0xcb193) [0x7f3af02bc193]
  8: (()+0xcb28e) [0x7f3af02bc28e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7c9) [0x7fd069]
  10: (FileStore::_collection_add(coll_t, coll_t, hobject_t const,
SequencerPosition const)+0x77d) [0x72ff0d]
  11: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long,
int)+0x25fb) [0x73481b]
  12: (FileStore::do_transactions(std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* , unsigned long)+0x4c)
[0x73952c]
  13: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45]
  14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b]
  15: (ThreadPool::WorkThread::entry()+0x10) [0x833700]
  16: (()+0x68ca) [0x7f3af16578ca]
  17: (clone()+0x6d) [0x7f3aefac6bfd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to
interpret this.

--- begin dump of recent events ---
  0 2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal
(Aborted) **
  in thread 7f3ae87d0700

  ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039)
  1: /usr/bin/ceph-osd() [0x799769]
  2: (()+0xeff0) [0x7f3af165fff0]
  3: (gsignal()+0x35) [0x7f3aefa29215]
  4: (abort()+0x180) [0x7f3aefa2c020]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f3af02bddc5]
  6: (()+0xcb166) [0x7f3af02bc166]
  7: (()+0xcb193) 

Re: librbd discard bug problems - i got it

2012-11-19 Thread Stefan Priebe

Am 20.11.2012 00:33, schrieb Josh Durgin:

On 11/19/2012 03:16 PM, Stefan Priebe wrote:

mhm qemu rbd block driver. Get's always these errors back. As
rbd_aio_bh_cb is directly called from librbd the problem must be there.
Strangely i can't find where rbd_aio_bh_cb get's called with -512.

ANy further ideas?


Two ideas:

1) Is rbd_finish_aiocb getting this same return value?

Will check this tomorrow.



2) Perhaps it's an issue with the return value wrapping around with
very large discards. Adding some logging of the return values of each
rados operation in AioCompletion::complete_request() might give us a
clue. These large negative return values are suspicious.


Good idea. As r and rval is int it is limited. But 
AioCompletion::complete_request is adding more and more stuff to rval. 
What could be a solution? Bump rval to int64? Or wrap to around to start 
at 0 again?


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH, v2] rbd: do not allow remove of mounted-on image

2012-11-19 Thread Alex Elder
There is no check in rbd_remove() to see if anybody holds open the
image being removed.  That's not cool.

Add a simple open count that goes up and down with opens and closes
(releases) of the device, and don't allow an rbd image to be removed
if the count is non-zero.

Protect the updates of the open count value with ctl_mutex to ensure
the underlying rbd device doesn't get removed while concurrently
being opened.

Signed-off-by: Alex Elder el...@inktank.com
---
v2: added ctl_mutex locking for rbd_open() and rbd_release()

 drivers/block/rbd.c |   13 +
 1 file changed, 13 insertions(+)

Index: b/drivers/block/rbd.c
===
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -255,6 +255,7 @@ struct rbd_device {

/* sysfs related */
struct device   dev;
+   unsigned long   open_count;
 };

 static DEFINE_MUTEX(ctl_mutex);  /* Serialize 
open/close/setup/teardown */
@@ -356,8 +357,11 @@ static int rbd_open(struct block_device
if ((mode  FMODE_WRITE)  rbd_dev-mapping.read_only)
return -EROFS;

+   mutex_lock_nested(ctl_mutex, SINGLE_DEPTH_NESTING);
rbd_get_dev(rbd_dev);
set_device_ro(bdev, rbd_dev-mapping.read_only);
+   rbd_dev-open_count++;
+   mutex_unlock(ctl_mutex);

return 0;
 }
@@ -366,7 +370,11 @@ static int rbd_release(struct gendisk *d
 {
struct rbd_device *rbd_dev = disk-private_data;

+   mutex_lock_nested(ctl_mutex, SINGLE_DEPTH_NESTING);
+   rbd_assert(rbd_dev-open_count  0);
+   rbd_dev-open_count--;
rbd_put_dev(rbd_dev);
+   mutex_unlock(ctl_mutex);

return 0;
 }
@@ -3764,6 +3772,11 @@ static ssize_t rbd_remove(struct bus_typ
goto done;
}

+   if (rbd_dev-open_count) {
+   ret = -EBUSY;
+   goto done;
+   }
+
rbd_remove_all_snaps(rbd_dev);
rbd_bus_del_dev(rbd_dev);

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-osd crashing (os/FileStore.cc: 4500: FAILED assert(replaying))

2012-11-19 Thread Samuel Just
Can you restart one of the affected osds with debug osd = 20, debug
filestore = 20, debug ms = 1 and post the log?
-Sam

On Mon, Nov 19, 2012 at 3:39 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Am 20.11.2012 00:39, schrieb Samuel Just:

 Seems to be a truncated log file...  That usually indicates filesystem
 corruption.  Anything in dmesg?
 -Sam

 No. Everything fine.



 On Thu, Nov 15, 2012 at 1:07 PM, Stefan Priebe s.pri...@profihost.ag
 wrote:

 Hello list,

 actual master incl. upstream/wip-fd-simple-cache results in this crash
 when
 i try to start some of my osds (others work fine) today on multiple
 nodes:

  -2 2012-11-15 22:04:09.226945 7f3af1c7a780  0 osd.52 pg_epoch: 657
 pg[3.3b( v 632'823 (632'823,632'823] n=5 ec=17 les/c 18/18 656/656/17) []
 r=0 lpr=0 pi=17-655/2 (info mismatch, log(632'823,0'0]) (log bound
 mismatch,
 empty) lcod 0'0 mlcod 0'0 inactive] Got exception 'read_log_error:
 read_log
 got 0 bytes, expected 126086-0=126086' while reading log. Moving
 corrupted
 log file to 'corrupt_log_2012-11-15_22:04_3.3b' for later analysis.
  -1 2012-11-15 22:04:09.233563 7f3af1c7a780  0 osd.52 pg_epoch: 657
 pg[3.557( v 632'753 (0'0,632'753] n=2 ec=17 les/c 18/18 656/656/17) []
 r=0
 lpr=0 pi=17-655/2 (info mismatch, log(0'0,0'0]) lcod 0'0 mlcod 0'0
 inactive]
 Got exception 'read_log_error: read_log got 0 bytes, expected
 115488-0=115488' while reading log. Moving corrupted log file to
 'corrupt_log_2012-11-15_22:04_3.557' for later analysis.
   0 2012-11-15 22:04:09.234536 7f3ae87d0700 -1 os/FileStore.cc: In
 function 'int FileStore::_collection_add(coll_t, coll_t, const
 hobject_t,
 const SequencerPosition)' thread 7f3ae87d0700 time 2012-11-15
 22:04:09.233672
 os/FileStore.cc: 4500: FAILED assert(replaying)

   ceph version 0.54-607-gf89e101
 (f89e1012bafabd6875a4a1e1832d76ffdf45b039)
   1: (FileStore::_collection_add(coll_t, coll_t, hobject_t const,
 SequencerPosition const)+0x77d) [0x72ff0d]
   2: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long,
 int)+0x25fb) [0x73481b]
   3: (FileStore::do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long)+0x4c)
 [0x73952c]
   4: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45]
   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b]
   6: (ThreadPool::WorkThread::entry()+0x10) [0x833700]
   7: (()+0x68ca) [0x7f3af16578ca]
   8: (clone()+0x6d) [0x7f3aefac6bfd]
   NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to
 interpret this.

 --- logging levels ---
 0/ 5 none
 0/ 0 lockdep
 0/ 0 context
 0/ 0 crush
 1/ 5 mds
 1/ 5 mds_balancer
 1/ 5 mds_locker
 1/ 5 mds_log
 1/ 5 mds_log_expire
 1/ 5 mds_migrator
 0/ 0 buffer
 0/ 0 timer
 0/ 1 filer
 0/ 1 striper
 0/ 1 objecter
 0/ 5 rados
 0/ 5 rbd
 0/ 0 journaler
 0/ 5 objectcacher
 0/ 5 client
 0/ 0 osd
 0/ 0 optracker
 0/ 0 objclass
 0/ 0 filestore
 0/ 0 journal
 0/ 0 ms
 1/ 5 mon
 0/ 0 monc
 0/ 5 paxos
 0/ 0 tp
 0/ 0 auth
 1/ 5 crypto
 0/ 0 finisher
 0/ 0 heartbeatmap
 0/ 0 perfcounter
 1/ 5 rgw
 1/ 5 hadoop
 1/ 5 javaclient
 0/ 0 asok
 0/ 0 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 1
max_new  100
log_file /var/log/ceph/ceph-osd.52.log
 --- end dump of recent events ---
 2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal (Aborted) **
   in thread 7f3ae87d0700

   ceph version 0.54-607-gf89e101
 (f89e1012bafabd6875a4a1e1832d76ffdf45b039)
   1: /usr/bin/ceph-osd() [0x799769]
   2: (()+0xeff0) [0x7f3af165fff0]
   3: (gsignal()+0x35) [0x7f3aefa29215]
   4: (abort()+0x180) [0x7f3aefa2c020]
   5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f3af02bddc5]
   6: (()+0xcb166) [0x7f3af02bc166]
   7: (()+0xcb193) [0x7f3af02bc193]
   8: (()+0xcb28e) [0x7f3af02bc28e]
   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x7c9) [0x7fd069]
   10: (FileStore::_collection_add(coll_t, coll_t, hobject_t const,
 SequencerPosition const)+0x77d) [0x72ff0d]
   11: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long,
 int)+0x25fb) [0x73481b]
   12: (FileStore::do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long)+0x4c)
 [0x73952c]
   13: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45]
   14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b]
   15: (ThreadPool::WorkThread::entry()+0x10) [0x833700]
   16: (()+0x68ca) [0x7f3af16578ca]
   17: (clone()+0x6d) [0x7f3aefac6bfd]
   NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to
 interpret this.

 --- begin dump of recent events ---
   0 2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal
 (Aborted) **
   in thread 7f3ae87d0700

   ceph version 

Re: ceph-osd crashing (os/FileStore.cc: 4500: FAILED assert(replaying))

2012-11-19 Thread Stefan Priebe
I've formatted the cluster since then. But i'll report back if this 
happens again.


Stefan
Am 20.11.2012 00:43, schrieb Samuel Just:

Can you restart one of the affected osds with debug osd = 20, debug
filestore = 20, debug ms = 1 and post the log?
-Sam

On Mon, Nov 19, 2012 at 3:39 PM, Stefan Priebe s.pri...@profihost.ag wrote:

Am 20.11.2012 00:39, schrieb Samuel Just:


Seems to be a truncated log file...  That usually indicates filesystem
corruption.  Anything in dmesg?
-Sam


No. Everything fine.




On Thu, Nov 15, 2012 at 1:07 PM, Stefan Priebe s.pri...@profihost.ag
wrote:


Hello list,

actual master incl. upstream/wip-fd-simple-cache results in this crash
when
i try to start some of my osds (others work fine) today on multiple
nodes:

  -2 2012-11-15 22:04:09.226945 7f3af1c7a780  0 osd.52 pg_epoch: 657
pg[3.3b( v 632'823 (632'823,632'823] n=5 ec=17 les/c 18/18 656/656/17) []
r=0 lpr=0 pi=17-655/2 (info mismatch, log(632'823,0'0]) (log bound
mismatch,
empty) lcod 0'0 mlcod 0'0 inactive] Got exception 'read_log_error:
read_log
got 0 bytes, expected 126086-0=126086' while reading log. Moving
corrupted
log file to 'corrupt_log_2012-11-15_22:04_3.3b' for later analysis.
  -1 2012-11-15 22:04:09.233563 7f3af1c7a780  0 osd.52 pg_epoch: 657
pg[3.557( v 632'753 (0'0,632'753] n=2 ec=17 les/c 18/18 656/656/17) []
r=0
lpr=0 pi=17-655/2 (info mismatch, log(0'0,0'0]) lcod 0'0 mlcod 0'0
inactive]
Got exception 'read_log_error: read_log got 0 bytes, expected
115488-0=115488' while reading log. Moving corrupted log file to
'corrupt_log_2012-11-15_22:04_3.557' for later analysis.
   0 2012-11-15 22:04:09.234536 7f3ae87d0700 -1 os/FileStore.cc: In
function 'int FileStore::_collection_add(coll_t, coll_t, const
hobject_t,
const SequencerPosition)' thread 7f3ae87d0700 time 2012-11-15
22:04:09.233672
os/FileStore.cc: 4500: FAILED assert(replaying)

   ceph version 0.54-607-gf89e101
(f89e1012bafabd6875a4a1e1832d76ffdf45b039)
   1: (FileStore::_collection_add(coll_t, coll_t, hobject_t const,
SequencerPosition const)+0x77d) [0x72ff0d]
   2: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
long,
int)+0x25fb) [0x73481b]
   3: (FileStore::do_transactions(std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* , unsigned long)+0x4c)
[0x73952c]
   4: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45]
   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b]
   6: (ThreadPool::WorkThread::entry()+0x10) [0x833700]
   7: (()+0x68ca) [0x7f3af16578ca]
   8: (clone()+0x6d) [0x7f3aefac6bfd]
   NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to
interpret this.

--- logging levels ---
 0/ 5 none
 0/ 0 lockdep
 0/ 0 context
 0/ 0 crush
 1/ 5 mds
 1/ 5 mds_balancer
 1/ 5 mds_locker
 1/ 5 mds_log
 1/ 5 mds_log_expire
 1/ 5 mds_migrator
 0/ 0 buffer
 0/ 0 timer
 0/ 1 filer
 0/ 1 striper
 0/ 1 objecter
 0/ 5 rados
 0/ 5 rbd
 0/ 0 journaler
 0/ 5 objectcacher
 0/ 5 client
 0/ 0 osd
 0/ 0 optracker
 0/ 0 objclass
 0/ 0 filestore
 0/ 0 journal
 0/ 0 ms
 1/ 5 mon
 0/ 0 monc
 0/ 5 paxos
 0/ 0 tp
 0/ 0 auth
 1/ 5 crypto
 0/ 0 finisher
 0/ 0 heartbeatmap
 0/ 0 perfcounter
 1/ 5 rgw
 1/ 5 hadoop
 1/ 5 javaclient
 0/ 0 asok
 0/ 0 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 1
max_new  100
log_file /var/log/ceph/ceph-osd.52.log
--- end dump of recent events ---
2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal (Aborted) **
   in thread 7f3ae87d0700

   ceph version 0.54-607-gf89e101
(f89e1012bafabd6875a4a1e1832d76ffdf45b039)
   1: /usr/bin/ceph-osd() [0x799769]
   2: (()+0xeff0) [0x7f3af165fff0]
   3: (gsignal()+0x35) [0x7f3aefa29215]
   4: (abort()+0x180) [0x7f3aefa2c020]
   5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f3af02bddc5]
   6: (()+0xcb166) [0x7f3af02bc166]
   7: (()+0xcb193) [0x7f3af02bc193]
   8: (()+0xcb28e) [0x7f3af02bc28e]
   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7c9) [0x7fd069]
   10: (FileStore::_collection_add(coll_t, coll_t, hobject_t const,
SequencerPosition const)+0x77d) [0x72ff0d]
   11: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
long,
int)+0x25fb) [0x73481b]
   12: (FileStore::do_transactions(std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* , unsigned long)+0x4c)
[0x73952c]
   13: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45]
   14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b]
   15: (ThreadPool::WorkThread::entry()+0x10) [0x833700]
   16: (()+0x68ca) [0x7f3af16578ca]
   17: (clone()+0x6d) [0x7f3aefac6bfd]
   NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to
interpret this.

--- begin dump of recent events ---
   0 2012-11-15 22:04:09.235734 7f3ae87d0700 

Re: librbd discard bug problems - i got it

2012-11-19 Thread Josh Durgin

On 11/19/2012 03:42 PM, Stefan Priebe wrote:

Am 20.11.2012 00:33, schrieb Josh Durgin:

On 11/19/2012 03:16 PM, Stefan Priebe wrote:

mhm qemu rbd block driver. Get's always these errors back. As
rbd_aio_bh_cb is directly called from librbd the problem must be there.
Strangely i can't find where rbd_aio_bh_cb get's called with -512.

ANy further ideas?


Two ideas:

1) Is rbd_finish_aiocb getting this same return value?

Will check this tomorrow.



2) Perhaps it's an issue with the return value wrapping around with
very large discards. Adding some logging of the return values of each
rados operation in AioCompletion::complete_request() might give us a
clue. These large negative return values are suspicious.


Good idea. As r and rval is int it is limited. But
AioCompletion::complete_request is adding more and more stuff to rval.
What could be a solution? Bump rval to int64? Or wrap to around to start
at 0 again?


The final return value is limited to int at a few levels. Probably it's
best to make discard alway return 0 on success. aio_discard should
already be doing this, but perhaps it's not in this case.

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: librbd discard bug problems - i got it

2012-11-19 Thread Stefan Priebe

Hi Josh,

i don't get it. Every debug line i print is a prositive fine value. BUt 
rbd_aio_bh_cb get's called with these values. As you can see that are 
not much values i copied all values  0 from log for discarding a whole 
30GB device.


Stefan

Am 20.11.2012 00:47, schrieb Josh Durgin:

On 11/19/2012 03:42 PM, Stefan Priebe wrote:

Am 20.11.2012 00:33, schrieb Josh Durgin:

On 11/19/2012 03:16 PM, Stefan Priebe wrote:

mhm qemu rbd block driver. Get's always these errors back. As
rbd_aio_bh_cb is directly called from librbd the problem must be there.
Strangely i can't find where rbd_aio_bh_cb get's called with -512.

ANy further ideas?


Two ideas:

1) Is rbd_finish_aiocb getting this same return value?

Will check this tomorrow.



2) Perhaps it's an issue with the return value wrapping around with
very large discards. Adding some logging of the return values of each
rados operation in AioCompletion::complete_request() might give us a
clue. These large negative return values are suspicious.


Good idea. As r and rval is int it is limited. But
AioCompletion::complete_request is adding more and more stuff to rval.
What could be a solution? Bump rval to int64? Or wrap to around to start
at 0 again?


The final return value is limited to int at a few levels. Probably it's
best to make discard alway return 0 on success. aio_discard should
already be doing this, but perhaps it's not in this case.

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: librbd discard bug problems - i got it

2012-11-19 Thread Josh Durgin

On 11/19/2012 04:00 PM, Stefan Priebe wrote:

Hi Josh,

i don't get it. Every debug line i print is a prositive fine value. BUt
rbd_aio_bh_cb get's called with these values. As you can see that are
not much values i copied all values  0 from log for discarding a whole
30GB device.


Could you post the patch of the debug prints you added and the log?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: libcephfs create file with layout and replication

2012-11-19 Thread Gregory Farnum
On Sun, Nov 18, 2012 at 12:05 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:
 Wanna have a look at a first pass on this patch?

wip-client-open-layout

 Thanks,
 Noah

Just glanced over this, and I'm curious:
1) Why symlink another reference to your file_layout.h?
2) There's already a ceph_file_layout struct which is used widely
(MDS, kernel, userspace client). It also has an accompanying function
that does basic validity checks.


 On Sat, Nov 17, 2012 at 5:20 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:
 On Sat, Nov 17, 2012 at 4:15 PM, Sage Weil s...@inktank.com wrote:

 We ignore that for the purposes of getting the libcephfs API correct,
 though...

 Ok, make sense. Thanks.

 Noah

FYI, there's an unused __le32 in the open struct (used to be for
preferred PG). We should be able to steal that away without too much
pain or massaging! :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't start ceph mon

2012-11-19 Thread Gregory Farnum
Also, if you still have it, could you zip up your monitor data
directory and put it somewhere accessible to us? (I can provide you a
drop point if necessary.) We'd like to look at the file layouts a bit
since we thought we were properly handling ENOSPC-style issues.
-Greg

On Mon, Nov 19, 2012 at 1:45 PM, Gregory Farnum g...@inktank.com wrote:
 On Mon, Nov 19, 2012 at 1:08 PM, Dave Humphreys (Datatone)
 d...@datatone.co.uk wrote:

 I have a problem in which I can't start my ceph monitor. The log is shown 
 below.

 The log shows version 0.54. I was running 0.52 when the problem arose, and I 
 moved to the latest in case the newer version fixed the problem.

 The original failure happened a week or so ago, and could have been as a 
 result of running out of disk space when the ceph monitor log became huge.

 That is almost certainly the case, although I thought we were handling
 this issue better now.

 What should I do to recover the situation?

 Do you have other monitors in working order? The easiest way to handle
 it if that's the case is just to remove this monitor from the cluster
 and add it back in as a new monitor with a fresh store. If not we can
 look into reconstructing it.
 -Greg



 David





 2012-11-19 20:38:51.598468 7fc13fdc6780  0 ceph version 0.54 
 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150), process ceph-mon, pid 
 21012
 2012-11-19 20:38:51.598482 7fc13fdc6780  1 store(/ceph/mon.vault01) mount
 2012-11-19 20:38:51.598527 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 21
 2012-11-19 20:38:51.598542 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 magic = 21 bytes
 2012-11-19 20:38:51.598562 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 75
 2012-11-19 20:38:51.598567 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 feature_set = 75 bytes
 2012-11-19 20:38:51.598582 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 205
 2012-11-19 20:38:51.598586 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 monmap/latest = 205 bytes
 2012-11-19 20:38:51.598809 7fc13fdc6780  1 -- 10.0.1.1:6789/0 learned my 
 addr 10.0.1.1:6789/0
 2012-11-19 20:38:51.598818 7fc13fdc6780  1 accepter.accepter.bind 
 my_inst.addr is 10.0.1.1:6789/0 need_addr=0
 2012-11-19 20:38:51.599498 7fc13fdc6780  1 -- 10.0.1.1:6789/0 messenger.start
 2012-11-19 20:38:51.599508 7fc13fdc6780  1 accepter.accepter.start
 2012-11-19 20:38:51.599610 7fc13fdc6780  1 mon.vault01@-1(probing) e1 init 
 fsid 4d7d8d20-338c-4bdc-9918-9bcf04f9a832
 2012-11-19 20:38:51.599674 7fc13cdbe700  1 -- 10.0.1.1:6789/0  :/0 
 pipe(0x213c6c0 sd=14 :6789 pgs=0 cs=0 l=0).accept sd=14
 2012-11-19 20:38:51.599678 7fc141eff700  1 -- 10.0.1.1:6789/0  :/0 
 pipe(0x213c240 sd=9 :6789 pgs=0 cs=0 l=0).accept sd=9
 2012-11-19 20:38:51.599718 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 37
 2012-11-19 20:38:51.599723 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 cluster_uuid = 37 bytes
 2012-11-19 20:38:51.599718 7fc13ccbd700  1 -- 10.0.1.1:6789/0  :/0 
 pipe(0x213c480 sd=19 :6789 pgs=0 cs=0 l=0).accept sd=19
 2012-11-19 20:38:51.599729 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
 check_fsid cluster_uuid contains '4d7d8d20-338c-4bdc-9918-9bcf04f9a832'
 2012-11-19 20:38:51.599739 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 75
 2012-11-19 20:38:51.599745 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 feature_set = 75 bytes
 2012-11-19 20:38:51.599751 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
 features compat={},rocompat={},incompat={1=initial feature set (~v.18)}
 2012-11-19 20:38:51.599759 7fc13fdc6780 15 store(/ceph/mon.vault01) 
 exists_bl joined
 2012-11-19 20:38:51.599769 7fc13fdc6780 10 mon.vault01@-1(probing) e1 
 has_ever_joined = 1
 2012-11-19 20:38:51.599794 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int 
 pgmap/last_committed = 13
 2012-11-19 20:38:51.599801 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int 
 pgmap/first_committed = 132833
 2012-11-19 20:38:51.599810 7fc13fdc6780 20 store(/ceph/mon.vault01) reading 
 at off 0 of 239840
 2012-11-19 20:38:51.599928 7fc13cbbc700  1 -- 10.0.1.1:6789/0  :/0 
 pipe(0x213cd80 sd=20 :6789 pgs=0 cs=0 l=0).accept sd=20
 2012-11-19 20:38:51.599950 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl 
 pgmap/latest = 239840 bytes
 --- begin dump of recent events ---2012-11-19 20:38:51.600509 7fc13fdc6780 -1
 *** Caught signal (Aborted) **
  in thread 7fc13fdc6780

  ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150)
  1: ceph-mon() [0x53adf8]
  2: (()+0xfe90) [0x7fc141830e90]
  3: (gsignal()+0x3e) [0x7fc140016dae]
  4: (abort()+0x17b) [0x7fc14001825b]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc141af300d]
  6: (()+0xb31b6) [0x7fc141af11b6]
  7: (()+0xb31e3) [0x7fc141af11e3]
  8: (()+0xb32de) [0x7fc141af12de]
  9: ceph-mon() [0x5ecb9f]
  10: (Paxos::get_stashed(ceph::buffer::list)+0x1ed) [0x49e28d]
  11: (Paxos::init()+0x109) [0x49e609]
  12: (Monitor::init()+0x36a) [0x485a4a]
  13: 

Request to join mailing group

2012-11-19 Thread Pat Beadles


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libcephfs create file with layout and replication

2012-11-19 Thread Noah Watkins
On Mon, Nov 19, 2012 at 5:04 PM, Gregory Farnum g...@inktank.com wrote:

 Just glanced over this, and I'm curious:
 1) Why symlink another reference to your file_layout.h?

I followed the same pattern as page.h in librados, but may have
misunderstood its use. When libcephfs.h is installed, it includes

  #include file_layout.h

and we assume the user has -Iprefix/cephfs/.

but in the build tree, include/cephfs isn't an includes path used,
hence the symlink.

 2) There's already a ceph_file_layout struct which is used widely
 (MDS, kernel, userspace client). It also has an accompanying function
 that does basic validity checks.

I avoided ceph_file_layout because I was under the impression that all
of the __le64 stuff in it was very much Linux-specific. I had run into
a lot of this hacking on an OSX port.

 FYI, there's an unused __le32 in the open struct (used to be for
 preferred PG). We should be able to steal that away without too much
 pain or massaging! :)

Nice. Do you think I should revert back to using ceph_file_layout?

Thanks,
Noah
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libcephfs create file with layout and replication

2012-11-19 Thread Sage Weil
On Mon, 19 Nov 2012, Noah Watkins wrote:
 On Mon, Nov 19, 2012 at 5:04 PM, Gregory Farnum g...@inktank.com wrote:
 
  Just glanced over this, and I'm curious:
  1) Why symlink another reference to your file_layout.h?
 
 I followed the same pattern as page.h in librados, but may have
 misunderstood its use. When libcephfs.h is installed, it includes
 
   #include file_layout.h
 
 and we assume the user has -Iprefix/cephfs/.
 
 but in the build tree, include/cephfs isn't an includes path used,
 hence the symlink.
 
  2) There's already a ceph_file_layout struct which is used widely
  (MDS, kernel, userspace client). It also has an accompanying function
  that does basic validity checks.
 
 I avoided ceph_file_layout because I was under the impression that all
 of the __le64 stuff in it was very much Linux-specific. I had run into
 a lot of this hacking on an OSX port.
 
  FYI, there's an unused __le32 in the open struct (used to be for
  preferred PG). We should be able to steal that away without too much
  pain or massaging! :)
 
 Nice. Do you think I should revert back to using ceph_file_layout?

We could avoid the whole issue by passing 4 arguments to the function...
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Remote Ceph Install

2012-11-19 Thread Dan Mick



On 11/19/2012 11:42 AM, Blackwell, Edward wrote:

Hi,
I work for Harris Corporation, and we are investigating Ceph as a potential 
solution to a storage problem that one of our government customers is currently 
having.  I've already created a two-node cluster on a couple of VMs with 
another VM acting as an administrative client.  The cluster was created using 
some installation instructions supplied to us via Inktank, and through the use 
of the ceph-deploy script.  Aside from a couple of quirky discrepancies between 
the installation instructions and my environment, everything went well.  My 
issue has cropped up on the second cluster I'm trying to create, which is using 
a VM and a non-VM server for the nodes in the cluster.  Eventually, both nodes 
in this cluster will be non-VMs, but we're still waiting on the hardware for 
the second node, so I'm using a VM in the meantime just to get this second 
cluster up and going.  Of course, the administrative client node is still a VM.


Hi Ed.  Welcome.


The problem that I'm having with this second cluster concerns the non-VM server 
(elsceph01 for the sake of the commands mentioned from here on out).  In 
particular, the issue crops up with the ceph-deploy install elsceph01 command 
I'm executing on my client VM (cephclient01) to install Ceph on the non-VM 
server. The installation doesn't appear to be working as the command does not 
return the OK message that it should when it completes successfully.  I've 
tried using the verbose option on the command to see if that sheds any light on 
the subject, but alas, it does not:


root@cephclient01:~/my-admin-sandbox# ceph-deploy -v install elsceph01
DEBUG:ceph_deploy.install:Installing stable version argonaut on cluster ceph 
hosts elsceph01
DEBUG:ceph_deploy.install:Detecting platform for host elsceph01 ...
DEBUG:ceph_deploy.install:Installing for Ubuntu 12.04 on host elsceph01 ...
root@cephclient01:~/my-admin-sandbox#


Would you happen to have a breakdown of the commands being executed by the 
ceph-deploy script behind the scenes so I can maybe execute them one-by-one to 
see where the error is?  I have confirmed that it looks like the installation 
of the software has succeeded as I did a which ceph command on elsceph01, and 
it reported back /usr/bin/ceph.  Also, /etc/ceph/ceph.conf is there, and it 
matches the file created by the ceph-deploy new ... command on the client.  
Does the install command do a mkcephfs behind the scenes?  The reason I ask is 
that when I do the ceph-deploy mon command from the client, which is the next 
command listed in the instructions to do, I get this output:


Basically install just runs the appropriate debian package commands to 
get the requested release of Ceph installed on the target host (in this 
case, defaulting to argonaut).  The command normally doesn't issue any 
output.



root@cephclient01:~/my-admin-sandbox# ceph-deploy mon
creating /var/lib/ceph/tmp/ceph-ELSCEPH01.mon.keyring


This looks like there may be confusion about case in the hostname.  What 
does hostname on elsceph01 report?  If it's ELSCEPH01, that's probably 
the problem; the pathnames etc. are all case-sensitive.
Could be that /etc/hosts has the wrong case, or both cases, of the 
hostname in it?



2012-11-15 11:35:38.954261 7f7a6c274780 -1 asok(0x260b000) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
bind the UNIX domain socket to '/var/run/ceph/ceph-mon.ELSCEPH01.asok': (2) No 
such file or directory
Traceback (most recent call last):
   File /usr/local/bin/ceph-deploy, line 9, in module
 load_entry_point('ceph-deploy==0.0.1', 'console_scripts', 'ceph-deploy')()
   File /root/ceph-deploy/ceph_deploy/cli.py, line 80, in main
added entity mon. auth auth(auid = 18446744073709551615 
key=AQBWDj5QAP6LHhAAskVBnUkYHJ7eYREmKo5qKA== with 0 caps)
 return args.func(args)
mon/MonMap.h: In function 'void MonMap::add(const string, const 
entity_addr_t)' thread 7f7a6c274780 time 2012-11-15 11:35:38.955024
mon/MonMap.h: 97: FAILED assert(addr_name.count(addr) == 0)
ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
1: (MonMap::build_from_host_list(std::string, std::string)+0x738) [0x5988b8]
2: (MonMap::build_initial(CephContext*, std::ostream)+0x113) [0x59bd53]
3: (main()+0x12bb) [0x45ffab]
4: (__libc_start_main()+0xed) [0x7f7a6a6d776d]
5: ceph-mon() [0x462a19]
NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.
2012-11-15 11:35:38.955924 7f7a6c274780 -1 mon/MonMap.h: In function 'void 
MonMap::add(const string, const entity_addr_t)' thread 7f7a6c274780 time 
2012-11-15 11:35:38.955024
mon/MonMap.h: 97: FAILED assert(addr_name.count(addr) == 0)

ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
1: (MonMap::build_from_host_list(std::string, std::string)+0x738) [0x5988b8]
2: (MonMap::build_initial(CephContext*, std::ostream)+0x113) [0x59bd53]
3: (main()+0x12bb) [0x45ffab]
4: 

Re: RBD fio Performance concerns

2012-11-19 Thread Alexandre DERUMIER
Which iodepth did you use for those benchs? 

iodepth = 100

filesize = 1G, 10G, 30G  , same result

(3 nodes,8 cores 2,5GHZ,32GB ram, with 6 osd each (15k drive) + journal on 
tmpfs)


Note that I can't get more than 6000 iops on a rbd device, but with more 
devices it's scale. (each fio is at 6000iops)

(I have same result with rbd module or with kvm guest)



- Mail original - 

De: Sébastien Han han.sebast...@gmail.com 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel ceph-devel@vger.kernel.org, Mark Kampe 
mark.ka...@inktank.com 
Envoyé: Lundi 19 Novembre 2012 21:57:59 
Objet: Re: RBD fio Performance concerns 

Which iodepth did you use for those benchs? 


 I really don't understand why I can't get more rand read iops with 4K block 
 ... 

Me neither, hope to get some clarification from the Inktank guys. It 
doesn't make any sense to me... 
-- 
Bien cordialement. 
Sébastien HAN. 


On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER aderum...@odiso.com 
wrote: 
@Alexandre: is it the same for you? or do you always get more IOPS with seq? 
 
 rand read 4K : 6000 iops 
 seq read 4K : 3500 iops 
 seq read 4M : 31iops (1gigabit client bandwith limit) 
 
 rand write 4k: 6000iops (tmpfs journal) 
 seq write 4k: 1600iops 
 seq write 4M : 31iops (1gigabit client bandwith limit) 
 
 
 I really don't understand why I can't get more rand read iops with 4K block 
 ... 
 
 I try with high end cpu for client, it doesn't change nothing. 
 But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is around 15% on 
 cluster during read bench) 
 
 
 - Mail original - 
 
 De: Sébastien Han han.sebast...@gmail.com 
 À: Mark Kampe mark.ka...@inktank.com 
 Cc: Alexandre DERUMIER aderum...@odiso.com, ceph-devel 
 ceph-devel@vger.kernel.org 
 Envoyé: Lundi 19 Novembre 2012 19:03:40 
 Objet: Re: RBD fio Performance concerns 
 
 @Sage, thanks for the info :) 
 @Mark: 
 
 If you want to do sequential I/O, you should do it buffered 
 (so that the writes can be aggregated) or with a 4M block size 
 (very efficient and avoiding object serialization). 
 
 The original benchmark has been performed with 4M block size. And as 
 you can see I still get more IOPS with rand than seq... I just tried 
 with 4M without direct I/O, still the same. I can print fio results if 
 it's needed. 
 
 We do direct writes for benchmarking, not because it is a reasonable 
 way to do I/O, but because it bypasses the buffer cache and enables 
 us to directly measure cluster I/O throughput (which is what we are 
 trying to optimize). Applications should usually do buffered I/O, 
 to get the (very significant) benefits of caching and write aggregation. 
 
 I know why I use direct I/O. It's synthetic benchmarks, it's far away 
 from a real life scenario and how common applications works. I just 
 try to see the maximum I/O throughput that I can get from my RBD. All 
 my applications use buffered I/O. 
 
 @Alexandre: is it the same for you? or do you always get more IOPS with seq? 
 
 Thanks to all of you.. 
 
 
 On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe mark.ka...@inktank.com wrote: 
 Recall: 
 1. RBD volumes are striped (4M wide) across RADOS objects 
 2. distinct writes to a single RADOS object are serialized 
 
 Your sequential 4K writes are direct, depth=256, so there are 
 (at all times) 256 writes queued to the same object. All of 
 your writes are waiting through a very long line, which is adding 
 horrendous latency. 
 
 If you want to do sequential I/O, you should do it buffered 
 (so that the writes can be aggregated) or with a 4M block size 
 (very efficient and avoiding object serialization). 
 
 We do direct writes for benchmarking, not because it is a reasonable 
 way to do I/O, but because it bypasses the buffer cache and enables 
 us to directly measure cluster I/O throughput (which is what we are 
 trying to optimize). Applications should usually do buffered I/O, 
 to get the (very significant) benefits of caching and write aggregation. 
 
 
 That's correct for some of the benchmarks. However even with 4K for 
 seq, I still get less IOPS. See below my last fio: 
 
 # fio rbd-bench.fio 
 seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 
 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, 
 iodepth=256 
 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 
 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, 
 iodepth=256 
 fio 1.59 
 Starting 4 processes 
 Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 
 02m:59s] 
 seq-read: (groupid=0, jobs=1): err= 0: pid=15096 
 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec 
 slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 
 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 
 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 
 bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24, 
 stdev=6239.06 
 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279