Re: [BUG] rbd discard should return OK even if rbd file does not exist
Hi Josh, i got the following info from the qemu devs. The discards get canceled by the client kernel as they take TOO long. This happens due to the fact that ceph handle discards as buffered I/O. I see that there are max pending 800 requests. And rbd returns success first when there are no requests left. This is TOO long for the kernel. I think discards must be changed to unbuffered I/O to solve this. Greets, Stefan Am 18.11.2012 03:38, schrieb Josh Durgin: On 11/17/2012 02:19 PM, Stefan Priebe wrote: Hello list, right now librbd returns an error if i issue a discard for a sector / byterange where ceph does not have any file as i had never written to this section. Thanks for bringing this up again. I haven't had time to dig deeper into it yet, but I definitely want to fix this for bobtail. This is not correct. It should return 0 / OK in this case. Stefan Examplelog: 2012-11-02 21:06:17.649922 7f745f7fe700 20 librbd::AioRequest: WRITE_FLAT 2012-11-02 21:06:17.649924 7f745f7fe700 20 librbd::AioCompletion: AioCompletion::complete_request() this=0x7f72cc05bd20 complete_cb=0x7f747021d4b0 2012-11-02 21:06:17.649924 7f747015c780 1 -- 10.10.0.2:0/2028325 -- 10.10.0.18:6803/9687 -- osd_op(client.26862.0:3073 rb.0.1044.359ed6c7.0bde [delete] 3.bd84636 snapc 2=[]) v4 -- ?+0 0x7f72d81c69b0 con 0x7f74600dbf50 2012-11-02 21:06:17.649934 7f747015c780 20 librbd: oid rb.0.1044.359ed6c7.0bdf 0~4194304 from [4156556288,4194304] 2012-11-02 21:06:17.649972 7f7465a6e700 1 -- 10.10.0.2:0/2028325 == osd.1202 10.10.0.18:6806/9821 143 osd_op_reply(1652 rb.0.1044.359ed6c7.0652 [delete] ondisk = -2 (No such file or directory)) v4 130+0+0 (2964367729 0 0) 0x7f72dc0f0090 con 0x7f74600e4350 2012-11-02 21:06:17.649994 7f745f7fe700 20 librbd::AioRequest: write 0x7f74600feab0 should_complete: r = -2 This last line isn't printing what's actually being returned to the application. It's still in librbd's internal processing, and will be converted to 0 for the application. Could you try with the master or next branches? After the 'should_complete' line, there should be a line like: date time thread_id 20 librbd::AioCompletion: AioCompletion::finalize() rval 0 ... That 'rval 0' shows the actual return value the application (qemu in this case) will see. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] rbd discard should return OK even if rbd file does not exist
sorry meant the building in this case. The building of 900 requests takes too long. So the kernel starts to cancel these I/O requests. void AioCompletion::finish_adding_requests(CephContext *cct) { ldout(cct, 20) AioCompletion::finish_adding_requests (void*)this pending pending_count dendl; lock.Lock(); assert(building); building = false; if (!pending_count) { finalize(cct, rval); complete(); } lock.Unlock(); } Finanlize and complete is only done when pending_count is 0 so all I/O is done. Stefan Am 19.11.2012 09:38, schrieb Stefan Priebe - Profihost AG: Hi Josh, i got the following info from the qemu devs. The discards get canceled by the client kernel as they take TOO long. This happens due to the fact that ceph handle discards as buffered I/O. I see that there are max pending 800 requests. And rbd returns success first when there are no requests left. This is TOO long for the kernel. I think discards must be changed to unbuffered I/O to solve this. Greets, Stefan Am 18.11.2012 03:38, schrieb Josh Durgin: On 11/17/2012 02:19 PM, Stefan Priebe wrote: Hello list, right now librbd returns an error if i issue a discard for a sector / byterange where ceph does not have any file as i had never written to this section. Thanks for bringing this up again. I haven't had time to dig deeper into it yet, but I definitely want to fix this for bobtail. This is not correct. It should return 0 / OK in this case. Stefan Examplelog: 2012-11-02 21:06:17.649922 7f745f7fe700 20 librbd::AioRequest: WRITE_FLAT 2012-11-02 21:06:17.649924 7f745f7fe700 20 librbd::AioCompletion: AioCompletion::complete_request() this=0x7f72cc05bd20 complete_cb=0x7f747021d4b0 2012-11-02 21:06:17.649924 7f747015c780 1 -- 10.10.0.2:0/2028325 -- 10.10.0.18:6803/9687 -- osd_op(client.26862.0:3073 rb.0.1044.359ed6c7.0bde [delete] 3.bd84636 snapc 2=[]) v4 -- ?+0 0x7f72d81c69b0 con 0x7f74600dbf50 2012-11-02 21:06:17.649934 7f747015c780 20 librbd: oid rb.0.1044.359ed6c7.0bdf 0~4194304 from [4156556288,4194304] 2012-11-02 21:06:17.649972 7f7465a6e700 1 -- 10.10.0.2:0/2028325 == osd.1202 10.10.0.18:6806/9821 143 osd_op_reply(1652 rb.0.1044.359ed6c7.0652 [delete] ondisk = -2 (No such file or directory)) v4 130+0+0 (2964367729 0 0) 0x7f72dc0f0090 con 0x7f74600e4350 2012-11-02 21:06:17.649994 7f745f7fe700 20 librbd::AioRequest: write 0x7f74600feab0 should_complete: r = -2 This last line isn't printing what's actually being returned to the application. It's still in librbd's internal processing, and will be converted to 0 for the application. Could you try with the master or next branches? After the 'should_complete' line, there should be a line like: date time thread_id 20 librbd::AioCompletion: AioCompletion::finalize() rval 0 ... That 'rval 0' shows the actual return value the application (qemu in this case) will see. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] rbd discard should return OK even if rbd file does not exist
Hi Josh, sorry for the bunch of mails. It turns out not to be a bug in RBD or ceph but a bug in the linux kernel itself. Paolo from qemu told me the linux kernel should serialize these requests instead of sending the whole bunch and then hoping that all of them get's handling in miliseconds. Stefan Am 18.11.2012 03:38, schrieb Josh Durgin: On 11/17/2012 02:19 PM, Stefan Priebe wrote: Hello list, right now librbd returns an error if i issue a discard for a sector / byterange where ceph does not have any file as i had never written to this section. Thanks for bringing this up again. I haven't had time to dig deeper into it yet, but I definitely want to fix this for bobtail. This is not correct. It should return 0 / OK in this case. Stefan Examplelog: 2012-11-02 21:06:17.649922 7f745f7fe700 20 librbd::AioRequest: WRITE_FLAT 2012-11-02 21:06:17.649924 7f745f7fe700 20 librbd::AioCompletion: AioCompletion::complete_request() this=0x7f72cc05bd20 complete_cb=0x7f747021d4b0 2012-11-02 21:06:17.649924 7f747015c780 1 -- 10.10.0.2:0/2028325 -- 10.10.0.18:6803/9687 -- osd_op(client.26862.0:3073 rb.0.1044.359ed6c7.0bde [delete] 3.bd84636 snapc 2=[]) v4 -- ?+0 0x7f72d81c69b0 con 0x7f74600dbf50 2012-11-02 21:06:17.649934 7f747015c780 20 librbd: oid rb.0.1044.359ed6c7.0bdf 0~4194304 from [4156556288,4194304] 2012-11-02 21:06:17.649972 7f7465a6e700 1 -- 10.10.0.2:0/2028325 == osd.1202 10.10.0.18:6806/9821 143 osd_op_reply(1652 rb.0.1044.359ed6c7.0652 [delete] ondisk = -2 (No such file or directory)) v4 130+0+0 (2964367729 0 0) 0x7f72dc0f0090 con 0x7f74600e4350 2012-11-02 21:06:17.649994 7f745f7fe700 20 librbd::AioRequest: write 0x7f74600feab0 should_complete: r = -2 This last line isn't printing what's actually being returned to the application. It's still in librbd's internal processing, and will be converted to 0 for the application. Could you try with the master or next branches? After the 'should_complete' line, there should be a line like: date time thread_id 20 librbd::AioCompletion: AioCompletion::finalize() rval 0 ... That 'rval 0' shows the actual return value the application (qemu in this case) will see. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] rbd discard should return OK even if rbd file does not exist
But strange enough this works fine with normal iscsi target... no idea why. Stefan Am 19.11.2012 11:15, schrieb Stefan Priebe - Profihost AG: Hi Josh, sorry for the bunch of mails. It turns out not to be a bug in RBD or ceph but a bug in the linux kernel itself. Paolo from qemu told me the linux kernel should serialize these requests instead of sending the whole bunch and then hoping that all of them get's handling in miliseconds. Stefan Am 18.11.2012 03:38, schrieb Josh Durgin: On 11/17/2012 02:19 PM, Stefan Priebe wrote: Hello list, right now librbd returns an error if i issue a discard for a sector / byterange where ceph does not have any file as i had never written to this section. Thanks for bringing this up again. I haven't had time to dig deeper into it yet, but I definitely want to fix this for bobtail. This is not correct. It should return 0 / OK in this case. Stefan Examplelog: 2012-11-02 21:06:17.649922 7f745f7fe700 20 librbd::AioRequest: WRITE_FLAT 2012-11-02 21:06:17.649924 7f745f7fe700 20 librbd::AioCompletion: AioCompletion::complete_request() this=0x7f72cc05bd20 complete_cb=0x7f747021d4b0 2012-11-02 21:06:17.649924 7f747015c780 1 -- 10.10.0.2:0/2028325 -- 10.10.0.18:6803/9687 -- osd_op(client.26862.0:3073 rb.0.1044.359ed6c7.0bde [delete] 3.bd84636 snapc 2=[]) v4 -- ?+0 0x7f72d81c69b0 con 0x7f74600dbf50 2012-11-02 21:06:17.649934 7f747015c780 20 librbd: oid rb.0.1044.359ed6c7.0bdf 0~4194304 from [4156556288,4194304] 2012-11-02 21:06:17.649972 7f7465a6e700 1 -- 10.10.0.2:0/2028325 == osd.1202 10.10.0.18:6806/9821 143 osd_op_reply(1652 rb.0.1044.359ed6c7.0652 [delete] ondisk = -2 (No such file or directory)) v4 130+0+0 (2964367729 0 0) 0x7f72dc0f0090 con 0x7f74600e4350 2012-11-02 21:06:17.649994 7f745f7fe700 20 librbd::AioRequest: write 0x7f74600feab0 should_complete: r = -2 This last line isn't printing what's actually being returned to the application. It's still in librbd's internal processing, and will be converted to 0 for the application. Could you try with the master or next branches? After the 'should_complete' line, there should be a line like: date time thread_id 20 librbd::AioCompletion: AioCompletion::finalize() rval 0 ... That 'rval 0' shows the actual return value the application (qemu in this case) will see. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd tool changed format? (breaks compatibility)
On 11/16/2012 07:14 PM, Josh Durgin wrote: On 11/16/2012 06:36 AM, Constantinos Venetsanopoulos wrote: Hello ceph team, As you may already know, our team in GRNET is building a complete open source cloud platform called Synnefo [1], which already powers our production public cloud service ~okeanos [2]. Synnefo is using Google Ganeti for the low level VM management part [3]. As of Jan 2012, we have merged to upstream Ganeti support for VM disks on RADOS [4]. Today we received some feedback, that other people trying to run Ganeti with RADOS get an error because probably the output of the 'rbd showmapped' command has changed. I'd like to ask if indeed the output format of the rbd tool has changed. More specifically: 1. Does the 'rbd showmapped' command still returns just the headers if no device is mapped? No Ack. 2. Has the separator between the 'rbd showmapped' columns changed from \t to ? Yes, this is in the release notes for 0.54 (http://ceph.com/docs/master/release-notes/#v0-54). Ack. I don't have the latest rbd tool setup (but rather ceph-common=0.48.1argonaut-1~bpo60+1), so I can't test it right now, but I see this commit: https://github.com/ceph/ceph/commit/bed55369a96c2651f513b8c9b1a7bb92fb87550a Yeah, that's the commit that changed it. How stable can we consider rbd tool's output format? This is something we want to run in production environment. Using the tool rather than the library makes things much simpler. Generally it won't change much, but I don't think it should be considered entirely unchanging. We'll add it to the release notes when the output does change. We'll probably switch other commands to use TextTable too, with the same results as with showmapped and lock list. We could send a message to the mailing list when the output changes as well, so you can prepare for a future release. That would be great, and highly appreciated. Please drop us an email at the following mailing lists, when the rbd tool's format changes: synnefo-de...@googlegroups.com ganeti-de...@googlegroups.com Perhaps we should add a --format json|plain option so you don't have to rely on particular formatting, you just parse the json. This would match existing usage by many 'ceph ...' commands, and be easier for scripts to use in general. That would be even better! That would be the best approach for us, since we use it inside python code. Parsing a json is very simple and we will be able to maintain compatibility even when the format changes. Thanks, Consantinos -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD fio Performance concerns
Hello Mark, First of all, thank you again for another accurate answer :-). I would have expected write aggregation and cylinder affinity to have eliminated some seeks and improved rotational latency resulting in better than theoretical random write throughput. Against those expectations 763/850 IOPS is not so impressive. But, it looks to me like you were running fio in a 1G file with 100 parallel requests. The default RBD stripe width is 4M. This means that those 100 parallel requests were being spread across 256 (1G/4M) objects. People in the know tell me that writes to a single object are serialized, which means that many of those (potentially) parallel writes were to the same object, and hence serialized. This would increase the average request time for the colliding operations, and reduce the aggregate throughput correspondingly. Use a bigger file (or a narrower stripe) and this will get better. I followed your advice and used a bigger file (10G) and an iodepth of 128 and I've been able to reach ~27k iops for rand reads but I couldn't reach more than 870 iops in randwrites... It's kind of expected. But the thing a still don't understand is: why the sequential read/writes are lower than the randoms onces? Or maybe do I just need to care about the bandwidth for those values? Thank you. Regards. -- Bien cordialement. Sébastien HAN. On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe mark.ka...@inktank.com wrote: On 11/15/2012 12:23 PM, Sébastien Han wrote: First of all, I would like to thank you for this well explained, structured and clear answer. I guess I got better IOPS thanks to the 10K disks. 10K RPM would bring your per-drive throughput (for 4K random writes) up to 142 IOPS and your aggregate cluster throughput up to 1700. This would predict a corresponding RADOSbench throughput somewhere above 425 (how much better depending on write aggregation and cylinder affinity). Your RADOSbench 708 now seems even more reasonable. To be really honest I wasn't so concerned about the RADOS benchmarks but more about the RBD fio benchmarks and the amont of IOPS that comes out of it, which I found à bit to low. Sticking with 4K random writes, it looks to me like you were running fio with libaio (which means direct, no buffer cache). Because it is direct, every I/O operation is really happening and the best sustained throughput you should expect from this cluster is the aggregate raw fio 4K write throughput (1700 IOPS) divided by two copies = 850 random 4K writes per second. If I read the output correctly you got 763 or about 90% of back-of-envelope. BUT, there are some footnotes (there always are with performance) If you had been doing buffered I/O you would have seen a lot more (up front) benefit from page caching ... but you wouldn't have been measuring real (and hence sustainable) I/O throughput ... which is ultimately limited by the heads on those twelve disk drives, where all of those writes ultimately wind up. It is easy to be fast if you aren't really doing the writes :-) I would have expected write aggregation and cylinder affinity to have eliminated some seeks and improved rotational latency resulting in better than theoretical random write throughput. Against those expectations 763/850 IOPS is not so impressive. But, it looks to me like you were running fio in a 1G file with 100 parallel requests. The default RBD stripe width is 4M. This means that those 100 parallel requests were being spread across 256 (1G/4M) objects. People in the know tell me that writes to a single object are serialized, which means that many of those (potentially) parallel writes were to the same object, and hence serialized. This would increase the average request time for the colliding operations, and reduce the aggregate throughput correspondingly. Use a bigger file (or a narrower stripe) and this will get better. Thus, getting 763 random 4K write IOPs out of those 12 drives still sounds about right to me. On 15 nov. 2012, at 19:43, Mark Kampe mark.ka...@inktank.com wrote: Dear Sebastien, Ross Turn forwarded me your e-mail. You sent a great deal of information, but it was not immediately obvious to me what your specific concern was. You have 4 servers, 3 OSDs per, 2 copy, and you measured a radosbench (4K object creation) throughput of 2.9MB/s (or 708 IOPS). I infer that you were disappointed by this number, but it looks right to me. Assuming typical 7200 RPM drives, I would guess that each of them would deliver a sustained direct 4K random write performance in the general neighborhood of: 4ms seek (short seeks with write-settle-downs) 4ms latency (1/2 rotation) 0ms write (4K/144MB/s ~ 30us) - 8ms or about 125 IOPS Your twelve drives should therefore have a sustainable aggregate direct 4K random write throughput of 1500 IOPS. Each 4K object create involves four writes (two copies, each getting
Re: RBD fio Performance concerns
If I remember, you use fio with 4MB block size for sequential. So it's normal that you have less ios, but more bandwith. That's correct for some of the benchmarks. However even with 4K for seq, I still get less IOPS. See below my last fio: # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 fio 1.59 Starting 4 processes Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 02m:59s] seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min=0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=200473/0/0, short=0/0/0 lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10% rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62 cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=1632349/0/0, short=0/0/0 lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01% seq-write: (groupid=2, jobs=1): err= 0: pid=18653 write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17 bw (KB/s) : min=7, max= 2165, per=104.03%, avg=764.65, stdev=353.97 cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, =64=99.4% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=0/11171/0, short=0/0/0 lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60% lat (msec): 1000=12.73%, 2000=66.36%, =2000=13.20% rand-write: (groupid=3, jobs=1): err= 0: pid=20446 write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37 clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 bw (KB/s) : min=0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45 cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=99.9% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=0/52147/0, short=0/0/0 lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10% lat (msec): 1000=2.91%, 2000=5.75%, =2000=1.33% Run status group 0 (all jobs): READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s, mint=60053msec, maxt=60053msec Run status group 1 (all jobs): READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s, maxb=111425KB/s, mint=60005msec, maxt=60005msec Run status group 2 (all jobs): WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s, mint=60725msec, maxt=60725msec Run status group 3 (all jobs): WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s, mint=60822msec, maxt=60822msec Disk stats (read/write): rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132, in_queue=33434120, util=99.79% Cheers! -- Bien cordialement. Sébastien HAN. On Mon, Nov 19, 2012 at 4:28 PM, Alexandre DERUMIER aderum...@odiso.com wrote: why the sequential read/writes are lower than the randoms onces? Or maybe do I just need to care about the bandwidth for those values? If I remember, you use fio with 4MB block size for sequential. So it's normal that you have less ios, but more
Re: RBD fio Performance concerns
On Mon, 19 Nov 2012, S?bastien Han wrote: If I remember, you use fio with 4MB block size for sequential. So it's normal that you have less ios, but more bandwith. That's correct for some of the benchmarks. However even with 4K for seq, I still get less IOPS. See below my last fio: Small IOs striped over large objects tends to mean that many IOs may get piled up behind a single object at a time. There is a new striping feature in RBD that lets you stripe small blocks over larger objects to mitigate this, but it means slower performance the rest of the time, and is only really useful for specific workloads (e.g., database journal file/device). sage # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 fio 1.59 Starting 4 processes Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 02m:59s] seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min=0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=200473/0/0, short=0/0/0 lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10% rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62 cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=1632349/0/0, short=0/0/0 lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01% seq-write: (groupid=2, jobs=1): err= 0: pid=18653 write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17 bw (KB/s) : min=7, max= 2165, per=104.03%, avg=764.65, stdev=353.97 cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, =64=99.4% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=0/11171/0, short=0/0/0 lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60% lat (msec): 1000=12.73%, 2000=66.36%, =2000=13.20% rand-write: (groupid=3, jobs=1): err= 0: pid=20446 write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37 clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 bw (KB/s) : min=0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45 cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=99.9% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=0/52147/0, short=0/0/0 lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10% lat (msec): 1000=2.91%, 2000=5.75%, =2000=1.33% Run status group 0 (all jobs): READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s, mint=60053msec, maxt=60053msec Run status group 1 (all jobs): READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s, maxb=111425KB/s, mint=60005msec, maxt=60005msec Run status group 2 (all jobs): WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s, mint=60725msec, maxt=60725msec Run status group 3 (all jobs): WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
Re: RBD fio Performance concerns
Recall: 1. RBD volumes are striped (4M wide) across RADOS objects 2. distinct writes to a single RADOS object are serialized Your sequential 4K writes are direct, depth=256, so there are (at all times) 256 writes queued to the same object. All of your writes are waiting through a very long line, which is adding horrendous latency. If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. That's correct for some of the benchmarks. However even with 4K for seq, I still get less IOPS. See below my last fio: # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 fio 1.59 Starting 4 processes Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 02m:59s] seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min=0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=200473/0/0, short=0/0/0 lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10% rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62 cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=1632349/0/0, short=0/0/0 lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01% seq-write: (groupid=2, jobs=1): err= 0: pid=18653 write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17 bw (KB/s) : min=7, max= 2165, per=104.03%, avg=764.65, stdev=353.97 cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, =64=99.4% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=0/11171/0, short=0/0/0 lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60% lat (msec): 1000=12.73%, 2000=66.36%, =2000=13.20% rand-write: (groupid=3, jobs=1): err= 0: pid=20446 write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37 clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 bw (KB/s) : min=0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45 cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=99.9% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=0/52147/0, short=0/0/0 lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10% lat (msec): 1000=2.91%, 2000=5.75%, =2000=1.33% Run status group 0 (all jobs): READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s, mint=60053msec, maxt=60053msec Run status group 1 (all jobs): READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
Re: Many dns domain names in radosgw
On Sat, Nov 17, 2012 at 1:50 PM, Sławomir Skowron szi...@gmail.com wrote: Welcome, I have a question. Is there, any way to support multiple domains names in one radosgw on virtual host type connection in S3 ?? Are you aiming at having multiple virtual domain names pointing at the same bucket? Currently a gateway can only be set up with a single domain, so the virtual bucket scheme will only translate subdomains of that domain as buckets. Starting at 0.55 there will be a way to point alternative domains to a specific bucket (by modifying their dns CNAME record), however, It doesn't sound like it's what you're looking for. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD fio Performance concerns
@Sage, thanks for the info :) @Mark: If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). The original benchmark has been performed with 4M block size. And as you can see I still get more IOPS with rand than seq... I just tried with 4M without direct I/O, still the same. I can print fio results if it's needed. We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. I know why I use direct I/O. It's synthetic benchmarks, it's far away from a real life scenario and how common applications works. I just try to see the maximum I/O throughput that I can get from my RBD. All my applications use buffered I/O. @Alexandre: is it the same for you? or do you always get more IOPS with seq? Thanks to all of you.. On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe mark.ka...@inktank.com wrote: Recall: 1. RBD volumes are striped (4M wide) across RADOS objects 2. distinct writes to a single RADOS object are serialized Your sequential 4K writes are direct, depth=256, so there are (at all times) 256 writes queued to the same object. All of your writes are waiting through a very long line, which is adding horrendous latency. If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. That's correct for some of the benchmarks. However even with 4K for seq, I still get less IOPS. See below my last fio: # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 fio 1.59 Starting 4 processes Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 02m:59s] seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min=0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=200473/0/0, short=0/0/0 lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10% rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62 cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=1632349/0/0, short=0/0/0 lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01% seq-write: (groupid=2, jobs=1): err= 0: pid=18653 write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17 bw (KB/s) : min=7, max= 2165, per=104.03%, avg=764.65, stdev=353.97 cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, =64=99.4% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1%
Re: Many dns domain names in radosgw
Yes. I am looking for using domain x.com, and y.com with virtual host buckets like b.x.com, c.y.com But if it's not possible i can handle this with cname *.x.com and use only b and c on x.com domain. Thanks for response. 19 lis 2012 19:02, Yehuda Sadeh yeh...@inktank.com napisał(a): On Sat, Nov 17, 2012 at 1:50 PM, Sławomir Skowron szi...@gmail.com wrote: Welcome, I have a question. Is there, any way to support multiple domains names in one radosgw on virtual host type connection in S3 ?? Are you aiming at having multiple virtual domain names pointing at the same bucket? Currently a gateway can only be set up with a single domain, so the virtual bucket scheme will only translate subdomains of that domain as buckets. Starting at 0.55 there will be a way to point alternative domains to a specific bucket (by modifying their dns CNAME record), however, It doesn't sound like it's what you're looking for. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Remote Ceph Install
Hi, I work for Harris Corporation, and we are investigating Ceph as a potential solution to a storage problem that one of our government customers is currently having. I've already created a two-node cluster on a couple of VMs with another VM acting as an administrative client. The cluster was created using some installation instructions supplied to us via Inktank, and through the use of the ceph-deploy script. Aside from a couple of quirky discrepancies between the installation instructions and my environment, everything went well. My issue has cropped up on the second cluster I'm trying to create, which is using a VM and a non-VM server for the nodes in the cluster. Eventually, both nodes in this cluster will be non-VMs, but we're still waiting on the hardware for the second node, so I'm using a VM in the meantime just to get this second cluster up and going. Of course, the administrative client node is still a VM. The problem that I'm having with this second cluster concerns the non-VM server (elsceph01 for the sake of the commands mentioned from here on out). In particular, the issue crops up with the ceph-deploy install elsceph01 command I'm executing on my client VM (cephclient01) to install Ceph on the non-VM server. The installation doesn't appear to be working as the command does not return the OK message that it should when it completes successfully. I've tried using the verbose option on the command to see if that sheds any light on the subject, but alas, it does not: root@cephclient01:~/my-admin-sandbox# ceph-deploy -v install elsceph01 DEBUG:ceph_deploy.install:Installing stable version argonaut on cluster ceph hosts elsceph01 DEBUG:ceph_deploy.install:Detecting platform for host elsceph01 ... DEBUG:ceph_deploy.install:Installing for Ubuntu 12.04 on host elsceph01 ... root@cephclient01:~/my-admin-sandbox# Would you happen to have a breakdown of the commands being executed by the ceph-deploy script behind the scenes so I can maybe execute them one-by-one to see where the error is? I have confirmed that it looks like the installation of the software has succeeded as I did a which ceph command on elsceph01, and it reported back /usr/bin/ceph. Also, /etc/ceph/ceph.conf is there, and it matches the file created by the ceph-deploy new ... command on the client. Does the install command do a mkcephfs behind the scenes? The reason I ask is that when I do the ceph-deploy mon command from the client, which is the next command listed in the instructions to do, I get this output: root@cephclient01:~/my-admin-sandbox# ceph-deploy mon creating /var/lib/ceph/tmp/ceph-ELSCEPH01.mon.keyring 2012-11-15 11:35:38.954261 7f7a6c274780 -1 asok(0x260b000) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.ELSCEPH01.asok': (2) No such file or directory Traceback (most recent call last): File /usr/local/bin/ceph-deploy, line 9, in module load_entry_point('ceph-deploy==0.0.1', 'console_scripts', 'ceph-deploy')() File /root/ceph-deploy/ceph_deploy/cli.py, line 80, in main added entity mon. auth auth(auid = 18446744073709551615 key=AQBWDj5QAP6LHhAAskVBnUkYHJ7eYREmKo5qKA== with 0 caps) return args.func(args) mon/MonMap.h: In function 'void MonMap::add(const string, const entity_addr_t)' thread 7f7a6c274780 time 2012-11-15 11:35:38.955024 mon/MonMap.h: 97: FAILED assert(addr_name.count(addr) == 0) ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: (MonMap::build_from_host_list(std::string, std::string)+0x738) [0x5988b8] 2: (MonMap::build_initial(CephContext*, std::ostream)+0x113) [0x59bd53] 3: (main()+0x12bb) [0x45ffab] 4: (__libc_start_main()+0xed) [0x7f7a6a6d776d] 5: ceph-mon() [0x462a19] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 2012-11-15 11:35:38.955924 7f7a6c274780 -1 mon/MonMap.h: In function 'void MonMap::add(const string, const entity_addr_t)' thread 7f7a6c274780 time 2012-11-15 11:35:38.955024 mon/MonMap.h: 97: FAILED assert(addr_name.count(addr) == 0) ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: (MonMap::build_from_host_list(std::string, std::string)+0x738) [0x5988b8] 2: (MonMap::build_initial(CephContext*, std::ostream)+0x113) [0x59bd53] 3: (main()+0x12bb) [0x45ffab] 4: (__libc_start_main()+0xed) [0x7f7a6a6d776d] 5: ceph-mon() [0x462a19] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. -1 2012-11-15 11:35:38.954261 7f7a6c274780 -1 asok(0x260b000) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.ELSCEPH01.asok': (2) No such file or directory 0 2012-11-15 11:35:38.955924 7f7a6c274780 -1 mon/MonMap.h: In function 'void MonMap::add(const string, const entity_addr_t)' thread 7f7a6c274780 time 2012-11-15 11:35:38.955024
Re: [PATCH] rbd: get rid of rbd_{get,put}_dev()
Reviewed-by: Dan Mick dan.m...@inktank.com On 11/16/2012 07:43 AM, Alex Elder wrote: The functions rbd_get_dev() and rbd_put_dev() are trivial wrappers that add no values, and their existence suggests they may do more than what they do. Get rid of them. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 14 ++ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 9d9a2f3..f4b5a64 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -337,16 +337,6 @@ void rbd_warn(struct rbd_device *rbd_dev, const char *fmt, ...) # define rbd_assert(expr)((void) 0) #endif /* !RBD_DEBUG */ -static struct device *rbd_get_dev(struct rbd_device *rbd_dev) -{ - return get_device(rbd_dev-dev); -} - -static void rbd_put_dev(struct rbd_device *rbd_dev) -{ - put_device(rbd_dev-dev); -} - static int rbd_dev_refresh(struct rbd_device *rbd_dev, u64 *hver); static int rbd_dev_v2_refresh(struct rbd_device *rbd_dev, u64 *hver); @@ -357,7 +347,7 @@ static int rbd_open(struct block_device *bdev, fmode_t mode) if ((mode FMODE_WRITE) rbd_dev-mapping.read_only) return -EROFS; - rbd_get_dev(rbd_dev); + (void) get_device(rbd_dev-dev); set_device_ro(bdev, rbd_dev-mapping.read_only); rbd_dev-open_count++; @@ -370,7 +360,7 @@ static int rbd_release(struct gendisk *disk, fmode_t mode) rbd_assert(rbd_dev-open_count 0); rbd_dev-open_count--; - rbd_put_dev(rbd_dev); + put_device(rbd_dev-dev); return 0; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd block driver fix race between aio completition and aio cancel
From: Stefan Priebe s.pri...@profhost.ag This one fixes a race qemu also had in iscsi block driver between cancellation and io completition. qemu_rbd_aio_cancel was not synchronously waiting for the end of the command. It also removes the useless cancelled flag and introduces instead a status flag with EINPROGRESS like iscsi block driver. Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/block/rbd.c b/block/rbd.c index 5a0f79f..7b3bcbb 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -76,7 +76,7 @@ typedef struct RBDAIOCB { int64_t sector_num; int error; struct BDRVRBDState *s; -int cancelled; +int status; } RBDAIOCB; typedef struct RADOSCB { @@ -376,9 +376,7 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) RBDAIOCB *acb = rcb-acb; int64_t r; -if (acb-cancelled) { -qemu_vfree(acb-bounce); -qemu_aio_release(acb); +if (acb-bh) { goto done; } @@ -406,9 +404,12 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) acb-ret = r; } } +acb-status = acb-ret; + /* Note that acb-bh can be NULL in case where the aio was cancelled */ acb-bh = qemu_bh_new(rbd_aio_bh_cb, acb); qemu_bh_schedule(acb-bh); + done: g_free(rcb); } @@ -573,7 +574,10 @@ static void qemu_rbd_close(BlockDriverState *bs) static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb) { RBDAIOCB *acb = (RBDAIOCB *) blockacb; -acb-cancelled = 1; + +while (acb-status == -EINPROGRESS) { +qemu_aio_wait(); +} } static AIOPool rbd_aio_pool = { @@ -642,10 +646,11 @@ static void rbd_aio_bh_cb(void *opaque) qemu_iovec_from_buf(acb-qiov, 0, acb-bounce, acb-qiov-size); } qemu_vfree(acb-bounce); -acb-common.cb(acb-common.opaque, (acb-ret 0 ? 0 : acb-ret)); qemu_bh_delete(acb-bh); acb-bh = NULL; +acb-common.cb(acb-common.opaque, (acb-ret 0 ? 0 : acb-ret)); + qemu_aio_release(acb); } @@ -689,8 +694,8 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, acb-ret = 0; acb-error = 0; acb-s = s; -acb-cancelled = 0; acb-bh = NULL; +acb-status = -EINPROGRESS; if (cmd == RBD_AIO_WRITE) { qemu_iovec_to_buf(acb-qiov, 0, acb-bounce, qiov-size); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
From Stefan Priebe s.pri...@profihost.ag # This line is ignored. From: Stefan Priebe s.pri...@profihost.ag Cc: pve-de...@pve.proxmox.com Cc: pbonz...@redhat.com Cc: ceph-devel@vger.kernel.org Subject: QEMU/PATCH: rbd block driver: fix race between completition and cancel In-Reply-To: ve-de...@pve.proxmox.com pbonz...@redhat.com ceph-devel@vger.kernel.org -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD fio Performance concerns
Which iodepth did you use for those benchs? I really don't understand why I can't get more rand read iops with 4K block ... Me neither, hope to get some clarification from the Inktank guys. It doesn't make any sense to me... -- Bien cordialement. Sébastien HAN. On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER aderum...@odiso.com wrote: @Alexandre: is it the same for you? or do you always get more IOPS with seq? rand read 4K : 6000 iops seq read 4K : 3500 iops seq read 4M : 31iops (1gigabit client bandwith limit) rand write 4k: 6000iops (tmpfs journal) seq write 4k: 1600iops seq write 4M : 31iops (1gigabit client bandwith limit) I really don't understand why I can't get more rand read iops with 4K block ... I try with high end cpu for client, it doesn't change nothing. But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is around 15% on cluster during read bench) - Mail original - De: Sébastien Han han.sebast...@gmail.com À: Mark Kampe mark.ka...@inktank.com Cc: Alexandre DERUMIER aderum...@odiso.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Lundi 19 Novembre 2012 19:03:40 Objet: Re: RBD fio Performance concerns @Sage, thanks for the info :) @Mark: If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). The original benchmark has been performed with 4M block size. And as you can see I still get more IOPS with rand than seq... I just tried with 4M without direct I/O, still the same. I can print fio results if it's needed. We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. I know why I use direct I/O. It's synthetic benchmarks, it's far away from a real life scenario and how common applications works. I just try to see the maximum I/O throughput that I can get from my RBD. All my applications use buffered I/O. @Alexandre: is it the same for you? or do you always get more IOPS with seq? Thanks to all of you.. On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe mark.ka...@inktank.com wrote: Recall: 1. RBD volumes are striped (4M wide) across RADOS objects 2. distinct writes to a single RADOS object are serialized Your sequential 4K writes are direct, depth=256, so there are (at all times) 256 writes queued to the same object. All of your writes are waiting through a very long line, which is adding horrendous latency. If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. That's correct for some of the benchmarks. However even with 4K for seq, I still get less IOPS. See below my last fio: # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 fio 1.59 Starting 4 processes Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 02m:59s] seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=200473/0/0, short=0/0/0 lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10% rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62 cpu : usr=8.26%, sys=49.11%,
Re: RBD fio Performance concerns
Hello Mark, See below my benchmarks results: -RADOS Bench with 4M block size write: # rados -p bench bench 300 write -t 32 --no-cleanup Maintaining 32 concurrent writes of 4194304 bytes for at least 300 seconds. 2012-11-19 21:35:01.722143min lat: 0.255396 max lat: 8.40212 avg lat: 1.14076 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 300 32 8414 8382 111.737 104 0.502774 1.14076 Total time run: 300.814954 Total writes made: 8414 Write size: 4194304 Bandwidth (MB/sec): 111.883 Stddev Bandwidth: 7.4274 Max bandwidth (MB/sec): 132 Min bandwidth (MB/sec): 56 Average Latency:1.14352 Stddev Latency: 1.18344 Max latency:8.40212 Min latency:0.255396 -RADOS Bench with 4M block size seq: # rados -p bench bench 300 seq -t 32 --no-cleanup 2012-11-19 21:40:35.128728min lat: 0.224415 max lat: 6.14781 avg lat: 1.1591 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 300 31 8284 8253 110.021 108 1.876981.1591 Total time run:300.931287 Total reads made: 8285 Read size:4194304 Bandwidth (MB/sec):110.125 Average Latency: 1.16177 Max latency: 6.14781 Min latency: 0.224415 -RBD FIO test, as you recommend I used 4M block size for seq tests for the first test. See below the fio configuration file used: [global] ioengine=libaio iodepth=4 size=1G runtime=60 filename=/dev/rbd1 [seq-read] rw=read bs=4M stonewall direct=1 [rand-read] rw=randread bs=4K stonewall direct=1 [seq-write] rw=write bs=4M stonewall direct=1 [rand-write] rw=randwrite bs=4K stonewall direct=1 Results iodepth 4 and 1G file: # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=4 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4 seq-write: (g=2): rw=write, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=4 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4 fio 1.59 Starting 4 processes Jobs: 1 (f=1): [___w] [64.2% done] [0K/2588K /s] [0 /632 iops] [eta 01m:18s] seq-read: (groupid=0, jobs=1): err= 0: pid=10586 read : io=1024.0MB, bw=110656KB/s, iops=27 , runt= 9476msec slat (usec): min=250 , max=1812 , avg=389.88, stdev=178.26 clat (msec): min=37 , max=615 , avg=147.42, stdev=102.77 lat (msec): min=38 , max=615 , avg=147.81, stdev=102.77 bw (KB/s) : min=84216, max=122390, per=99.60%, avg=110208.50, stdev=9149.98 cpu : usr=0.00%, sys=0.97%, ctx=1552, majf=0, minf=4119 IO depths: 1=0.4%, 2=0.8%, 4=98.8%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued r/w/d: total=256/0/0, short=0/0/0 lat (msec): 50=4.69%, 100=31.64%, 250=50.78%, 500=11.72%, 750=1.17% rand-read: (groupid=1, jobs=1): err= 0: pid=10868 read : io=161972KB, bw=2697.1KB/s, iops=674 , runt= 60036msec slat (usec): min=12 , max=346 , avg=39.89, stdev=10.04 clat (usec): min=570 , max=50215 , avg=5885.64, stdev=12119.46 lat (usec): min=601 , max=50258 , avg=5926.07, stdev=12117.44 bw (KB/s) : min= 2015, max= 3356, per=100.15%, avg=2701.03, stdev=276.41 cpu : usr=0.51%, sys=2.14%, ctx=66054, majf=0, minf=26 IO depths: 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued r/w/d: total=40493/0/0, short=0/0/0 lat (usec): 750=3.69%, 1000=60.21% lat (msec): 2=19.37%, 4=1.49%, 10=1.30%, 20=0.30%, 50=13.64% lat (msec): 100=0.01% seq-write: (groupid=2, jobs=1): err= 0: pid=12619 write: io=1024.0MB, bw=112412KB/s, iops=27 , runt= 9328msec slat (usec): min=510 , max=1683 , avg=820.63, stdev=150.32 clat (msec): min=47 , max=744 , avg=144.21, stdev=73.99 lat (msec): min=48 , max=744 , avg=145.03, stdev=74.00 bw (KB/s) : min=103193, max=124830, per=100.87%, avg=113390.71, stdev=6178.93 cpu : usr=1.46%, sys=0.81%, ctx=267, majf=0, minf=21 IO depths: 1=0.4%, 2=0.8%, 4=98.8%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued r/w/d: total=0/256/0, short=0/0/0 lat (msec): 50=0.78%, 100=17.97%, 250=75.39%, 500=5.08%, 750=0.78% rand-write: (groupid=3, jobs=1): err= 0: pid=12934 write: io=125352KB, bw=2088.1KB/s, iops=522 , runt= 60007msec slat (usec): min=13 , max=388 , avg=50.47, stdev=13.73 clat (msec): min=1 , max=1271 , avg= 7.60, stdev=22.16 lat (msec): min=1 , max=1271 , avg= 7.66, stdev=22.16 bw (KB/s) : min= 155, max= 2944, per=102.13%, avg=2132.45,
Can't start ceph mon
I have a problem in which I can't start my ceph monitor. The log is shown below. The log shows version 0.54. I was running 0.52 when the problem arose, and I moved to the latest in case the newer version fixed the problem. The original failure happened a week or so ago, and could have been as a result of running out of disk space when the ceph monitor log became huge. What should I do to recover the situation? David 2012-11-19 20:38:51.598468 7fc13fdc6780 0 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150), process ceph-mon, pid 21012 2012-11-19 20:38:51.598482 7fc13fdc6780 1 store(/ceph/mon.vault01) mount 2012-11-19 20:38:51.598527 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 21 2012-11-19 20:38:51.598542 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl magic = 21 bytes 2012-11-19 20:38:51.598562 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 75 2012-11-19 20:38:51.598567 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl feature_set = 75 bytes 2012-11-19 20:38:51.598582 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 205 2012-11-19 20:38:51.598586 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl monmap/latest = 205 bytes 2012-11-19 20:38:51.598809 7fc13fdc6780 1 -- 10.0.1.1:6789/0 learned my addr 10.0.1.1:6789/0 2012-11-19 20:38:51.598818 7fc13fdc6780 1 accepter.accepter.bind my_inst.addr is 10.0.1.1:6789/0 need_addr=0 2012-11-19 20:38:51.599498 7fc13fdc6780 1 -- 10.0.1.1:6789/0 messenger.start 2012-11-19 20:38:51.599508 7fc13fdc6780 1 accepter.accepter.start 2012-11-19 20:38:51.599610 7fc13fdc6780 1 mon.vault01@-1(probing) e1 init fsid 4d7d8d20-338c-4bdc-9918-9bcf04f9a832 2012-11-19 20:38:51.599674 7fc13cdbe700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c6c0 sd=14 :6789 pgs=0 cs=0 l=0).accept sd=14 2012-11-19 20:38:51.599678 7fc141eff700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c240 sd=9 :6789 pgs=0 cs=0 l=0).accept sd=9 2012-11-19 20:38:51.599718 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 37 2012-11-19 20:38:51.599723 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl cluster_uuid = 37 bytes 2012-11-19 20:38:51.599718 7fc13ccbd700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c480 sd=19 :6789 pgs=0 cs=0 l=0).accept sd=19 2012-11-19 20:38:51.599729 7fc13fdc6780 10 mon.vault01@-1(probing) e1 check_fsid cluster_uuid contains '4d7d8d20-338c-4bdc-9918-9bcf04f9a832' 2012-11-19 20:38:51.599739 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 75 2012-11-19 20:38:51.599745 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl feature_set = 75 bytes 2012-11-19 20:38:51.599751 7fc13fdc6780 10 mon.vault01@-1(probing) e1 features compat={},rocompat={},incompat={1=initial feature set (~v.18)} 2012-11-19 20:38:51.599759 7fc13fdc6780 15 store(/ceph/mon.vault01) exists_bl joined 2012-11-19 20:38:51.599769 7fc13fdc6780 10 mon.vault01@-1(probing) e1 has_ever_joined = 1 2012-11-19 20:38:51.599794 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int pgmap/last_committed = 13 2012-11-19 20:38:51.599801 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int pgmap/first_committed = 132833 2012-11-19 20:38:51.599810 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 239840 2012-11-19 20:38:51.599928 7fc13cbbc700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213cd80 sd=20 :6789 pgs=0 cs=0 l=0).accept sd=20 2012-11-19 20:38:51.599950 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl pgmap/latest = 239840 bytes --- begin dump of recent events ---2012-11-19 20:38:51.600509 7fc13fdc6780 -1 *** Caught signal (Aborted) ** in thread 7fc13fdc6780 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150) 1: ceph-mon() [0x53adf8] 2: (()+0xfe90) [0x7fc141830e90] 3: (gsignal()+0x3e) [0x7fc140016dae] 4: (abort()+0x17b) [0x7fc14001825b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc141af300d] 6: (()+0xb31b6) [0x7fc141af11b6] 7: (()+0xb31e3) [0x7fc141af11e3] 8: (()+0xb32de) [0x7fc141af12de] 9: ceph-mon() [0x5ecb9f] 10: (Paxos::get_stashed(ceph::buffer::list)+0x1ed) [0x49e28d] 11: (Paxos::init()+0x109) [0x49e609] 12: (Monitor::init()+0x36a) [0x485a4a] 13: (main()+0x1289) [0x46d909] 14: (__libc_start_main()+0xed) [0x7fc14000364d] 15: ceph-mon() [0x46fa09] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. -55 2012-11-19 20:38:51.596694 7fc13fdc6780 5 asok(0x213d000) register_command perfcounters_dump hook 0x2131050 -55 2012-11-19 20:38:51.596720 7fc13fdc6780 5 asok(0x213d000) register_command 1 hook 0x2131050 -54 2012-11-19 20:38:51.596725 7fc13fdc6780 5 asok(0x213d000) register_command perf dump hook 0x2131050 -53 2012-11-19 20:38:51.596735 7fc13fdc6780 5 asok(0x213d000) register_command perfcounters_schema hook 0x2131050 -52 2012-11-19 20:38:51.596740 7fc13fdc6780 5 asok(0x213d000) register_command 2 hook 0x2131050 -51 2012-11-19 20:38:51.596745 7fc13fdc6780 5 asok(0x213d000) register_command perf schema hook 0x2131050 -50 2012-11-19
Cannot Start Ceph Mon
(Apologies if this is seen to be a repeat posting: I think that the last attempt fell into the void). I can't start my ceph monitor. The log is below. Though this shows version 0.54, the problem arose whilst using 0.52. Something may have become corrupted when the disk space ran out due to an immense ceph mon log. Is there anything I can do to recover the situation? Regards, David bash-4.1# cat /var/log/ceph/mon.vault01.log 2012-11-19 20:38:51.598468 7fc13fdc6780 0 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150), process ceph-mon, pid 21012 2012-11-19 20:38:51.598482 7fc13fdc6780 1 store(/ceph/mon.vault01) mount 2012-11-19 20:38:51.598527 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 21 2012-11-19 20:38:51.598542 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl magic = 21 bytes 2012-11-19 20:38:51.598562 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 75 2012-11-19 20:38:51.598567 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl feature_set = 75 bytes 2012-11-19 20:38:51.598582 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 205 2012-11-19 20:38:51.598586 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl monmap/latest = 205 bytes 2012-11-19 20:38:51.598809 7fc13fdc6780 1 -- 10.0.1.1:6789/0 learned my addr 10.0.1.1:6789/0 2012-11-19 20:38:51.598818 7fc13fdc6780 1 accepter.accepter.bind my_inst.addr is 10.0.1.1:6789/0 need_addr=0 2012-11-19 20:38:51.599498 7fc13fdc6780 1 -- 10.0.1.1:6789/0 messenger.start 2012-11-19 20:38:51.599508 7fc13fdc6780 1 accepter.accepter.start 2012-11-19 20:38:51.599610 7fc13fdc6780 1 mon.vault01@-1(probing) e1 init fsid 4d7d8d20-338c-4bdc-9918-9bcf04f9a832 2012-11-19 20:38:51.599674 7fc13cdbe700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c6c0 sd=14 :6789 pgs=0 cs=0 l=0).accept sd=14 2012-11-19 20:38:51.599678 7fc141eff700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c240 sd=9 :6789 pgs=0 cs=0 l=0).accept sd=9 2012-11-19 20:38:51.599718 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 37 2012-11-19 20:38:51.599723 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl cluster_uuid = 37 bytes 2012-11-19 20:38:51.599718 7fc13ccbd700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c480 sd=19 :6789 pgs=0 cs=0 l=0).accept sd=19 2012-11-19 20:38:51.599729 7fc13fdc6780 10 mon.vault01@-1(probing) e1 check_fsid cluster_uuid contains '4d7d8d20-338c-4bdc-9918-9bcf04f9a832' 2012-11-19 20:38:51.599739 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 75 2012-11-19 20:38:51.599745 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl feature_set = 75 bytes 2012-11-19 20:38:51.599751 7fc13fdc6780 10 mon.vault01@-1(probing) e1 features compat={},rocompat={},incompat={1=initial feature set (~v.18)} 2012-11-19 20:38:51.599759 7fc13fdc6780 15 store(/ceph/mon.vault01) exists_bl joined 2012-11-19 20:38:51.599769 7fc13fdc6780 10 mon.vault01@-1(probing) e1 has_ever_joined = 1 2012-11-19 20:38:51.599794 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int pgmap/last_committed = 13 2012-11-19 20:38:51.599801 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int pgmap/first_committed = 132833 2012-11-19 20:38:51.599810 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 239840 2012-11-19 20:38:51.599928 7fc13cbbc700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213cd80 sd=20 :6789 pgs=0 cs=0 l=0).accept sd=20 2012-11-19 20:38:51.599950 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl pgmap/latest = 239840 bytes --- begin dump of recent events ---2012-11-19 20:38:51.600509 7fc13fdc6780 -1 *** Caught signal (Aborted) ** in thread 7fc13fdc6780 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150) 1: ceph-mon() [0x53adf8] 2: (()+0xfe90) [0x7fc141830e90] 3: (gsignal()+0x3e) [0x7fc140016dae] 4: (abort()+0x17b) [0x7fc14001825b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc141af300d] 6: (()+0xb31b6) [0x7fc141af11b6] 7: (()+0xb31e3) [0x7fc141af11e3] 8: (()+0xb32de) [0x7fc141af12de] 9: ceph-mon() [0x5ecb9f] 10: (Paxos::get_stashed(ceph::buffer::list)+0x1ed) [0x49e28d] 11: (Paxos::init()+0x109) [0x49e609] 12: (Monitor::init()+0x36a) [0x485a4a] 13: (main()+0x1289) [0x46d909] 14: (__libc_start_main()+0xed) [0x7fc14000364d] 15: ceph-mon() [0x46fa09] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. -55 2012-11-19 20:38:51.596694 7fc13fdc6780 5 asok(0x213d000) register_command perfcounters_dump hook 0x2131050 -55 2012-11-19 20:38:51.596720 7fc13fdc6780 5 asok(0x213d000) register_command 1 hook 0x2131050 -54 2012-11-19 20:38:51.596725 7fc13fdc6780 5 asok(0x213d000) register_command perf dump hook 0x2131050 -53 2012-11-19 20:38:51.596735 7fc13fdc6780 5 asok(0x213d000) register_command perfcounters_schema hook 0x2131050 -52 2012-11-19 20:38:51.596740 7fc13fdc6780 5 asok(0x213d000) register_command 2 hook 0x2131050 -51 2012-11-19 20:38:51.596745 7fc13fdc6780 5 asok(0x213d000) register_command perf schema hook 0x2131050 -50
librbd discard bug problems - i got it
Hello Josh, after digging three days around i got it. The problem is in aio_discard in internal.cc. The i/o fails when AioZero or AioTruncate is used. It works fine with AioRemove. It seems to depend on overlapping. Hopefully i'm able to provide a patch this nicht. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can't Start Ceph Mon
I can't start my ceph monitor, the log is attached below. Whilst the log shows 0.54, the problem arose with 0.52, and may have been caused when disk space ran out as a result of a huge set of ceph log files. Is there a way to recover? Ragards, David bash-4.1# cat /var/log/ceph/mon.vault01.log 2012-11-19 20:38:51.598468 7fc13fdc6780 0 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150), process ceph-mon, pid 21012 2012-11-19 20:38:51.598482 7fc13fdc6780 1 store(/ceph/mon.vault01) mount 2012-11-19 20:38:51.598527 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 21 2012-11-19 20:38:51.598542 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl magic = 21 bytes 2012-11-19 20:38:51.598562 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 75 2012-11-19 20:38:51.598567 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl feature_set = 75 bytes 2012-11-19 20:38:51.598582 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 205 2012-11-19 20:38:51.598586 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl monmap/latest = 205 bytes 2012-11-19 20:38:51.598809 7fc13fdc6780 1 -- 10.0.1.1:6789/0 learned my addr 10.0.1.1:6789/0 2012-11-19 20:38:51.598818 7fc13fdc6780 1 accepter.accepter.bind my_inst.addr is 10.0.1.1:6789/0 need_addr=0 2012-11-19 20:38:51.599498 7fc13fdc6780 1 -- 10.0.1.1:6789/0 messenger.start 2012-11-19 20:38:51.599508 7fc13fdc6780 1 accepter.accepter.start 2012-11-19 20:38:51.599610 7fc13fdc6780 1 mon.vault01@-1(probing) e1 init fsid 4d7d8d20-338c-4bdc-9918-9bcf04f9a832 2012-11-19 20:38:51.599674 7fc13cdbe700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c6c0 sd=14 :6789 pgs=0 cs=0 l=0).accept sd=14 2012-11-19 20:38:51.599678 7fc141eff700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c240 sd=9 :6789 pgs=0 cs=0 l=0).accept sd=9 2012-11-19 20:38:51.599718 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 37 2012-11-19 20:38:51.599723 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl cluster_uuid = 37 bytes 2012-11-19 20:38:51.599718 7fc13ccbd700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c480 sd=19 :6789 pgs=0 cs=0 l=0).accept sd=19 2012-11-19 20:38:51.599729 7fc13fdc6780 10 mon.vault01@-1(probing) e1 check_fsid cluster_uuid contains '4d7d8d20-338c-4bdc-9918-9bcf04f9a832' 2012-11-19 20:38:51.599739 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 75 2012-11-19 20:38:51.599745 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl feature_set = 75 bytes 2012-11-19 20:38:51.599751 7fc13fdc6780 10 mon.vault01@-1(probing) e1 features compat={},rocompat={},incompat={1=initial feature set (~v.18)} 2012-11-19 20:38:51.599759 7fc13fdc6780 15 store(/ceph/mon.vault01) exists_bl joined 2012-11-19 20:38:51.599769 7fc13fdc6780 10 mon.vault01@-1(probing) e1 has_ever_joined = 1 2012-11-19 20:38:51.599794 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int pgmap/last_committed = 13 2012-11-19 20:38:51.599801 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int pgmap/first_committed = 132833 2012-11-19 20:38:51.599810 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 239840 2012-11-19 20:38:51.599928 7fc13cbbc700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213cd80 sd=20 :6789 pgs=0 cs=0 l=0).accept sd=20 2012-11-19 20:38:51.599950 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl pgmap/latest = 239840 bytes --- begin dump of recent events ---2012-11-19 20:38:51.600509 7fc13fdc6780 -1 *** Caught signal (Aborted) ** in thread 7fc13fdc6780 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150) 1: ceph-mon() [0x53adf8] 2: (()+0xfe90) [0x7fc141830e90] 3: (gsignal()+0x3e) [0x7fc140016dae] 4: (abort()+0x17b) [0x7fc14001825b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc141af300d] 6: (()+0xb31b6) [0x7fc141af11b6] 7: (()+0xb31e3) [0x7fc141af11e3] 8: (()+0xb32de) [0x7fc141af12de] 9: ceph-mon() [0x5ecb9f] 10: (Paxos::get_stashed(ceph::buffer::list)+0x1ed) [0x49e28d] 11: (Paxos::init()+0x109) [0x49e609] 12: (Monitor::init()+0x36a) [0x485a4a] 13: (main()+0x1289) [0x46d909] 14: (__libc_start_main()+0xed) [0x7fc14000364d] 15: ceph-mon() [0x46fa09] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. -55 2012-11-19 20:38:51.596694 7fc13fdc6780 5 asok(0x213d000) register_command perfcounters_dump hook 0x2131050 -55 2012-11-19 20:38:51.596720 7fc13fdc6780 5 asok(0x213d000) register_command 1 hook 0x2131050 -54 2012-11-19 20:38:51.596725 7fc13fdc6780 5 asok(0x213d000) register_command perf dump hook 0x2131050 -53 2012-11-19 20:38:51.596735 7fc13fdc6780 5 asok(0x213d000) register_command perfcounters_schema hook 0x2131050 -52 2012-11-19 20:38:51.596740 7fc13fdc6780 5 asok(0x213d000) register_command 2 hook 0x2131050 -51 2012-11-19 20:38:51.596745 7fc13fdc6780 5 asok(0x213d000) register_command perf schema hook 0x2131050 -50 2012-11-19 20:38:51.596752 7fc13fdc6780 5 asok(0x213d000) register_command config show hook 0x2131050 -49 2012-11-19 20:38:51.596756
Re: Can't start ceph mon
On Mon, Nov 19, 2012 at 1:08 PM, Dave Humphreys (Datatone) d...@datatone.co.uk wrote: I have a problem in which I can't start my ceph monitor. The log is shown below. The log shows version 0.54. I was running 0.52 when the problem arose, and I moved to the latest in case the newer version fixed the problem. The original failure happened a week or so ago, and could have been as a result of running out of disk space when the ceph monitor log became huge. That is almost certainly the case, although I thought we were handling this issue better now. What should I do to recover the situation? Do you have other monitors in working order? The easiest way to handle it if that's the case is just to remove this monitor from the cluster and add it back in as a new monitor with a fresh store. If not we can look into reconstructing it. -Greg David 2012-11-19 20:38:51.598468 7fc13fdc6780 0 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150), process ceph-mon, pid 21012 2012-11-19 20:38:51.598482 7fc13fdc6780 1 store(/ceph/mon.vault01) mount 2012-11-19 20:38:51.598527 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 21 2012-11-19 20:38:51.598542 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl magic = 21 bytes 2012-11-19 20:38:51.598562 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 75 2012-11-19 20:38:51.598567 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl feature_set = 75 bytes 2012-11-19 20:38:51.598582 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 205 2012-11-19 20:38:51.598586 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl monmap/latest = 205 bytes 2012-11-19 20:38:51.598809 7fc13fdc6780 1 -- 10.0.1.1:6789/0 learned my addr 10.0.1.1:6789/0 2012-11-19 20:38:51.598818 7fc13fdc6780 1 accepter.accepter.bind my_inst.addr is 10.0.1.1:6789/0 need_addr=0 2012-11-19 20:38:51.599498 7fc13fdc6780 1 -- 10.0.1.1:6789/0 messenger.start 2012-11-19 20:38:51.599508 7fc13fdc6780 1 accepter.accepter.start 2012-11-19 20:38:51.599610 7fc13fdc6780 1 mon.vault01@-1(probing) e1 init fsid 4d7d8d20-338c-4bdc-9918-9bcf04f9a832 2012-11-19 20:38:51.599674 7fc13cdbe700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c6c0 sd=14 :6789 pgs=0 cs=0 l=0).accept sd=14 2012-11-19 20:38:51.599678 7fc141eff700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c240 sd=9 :6789 pgs=0 cs=0 l=0).accept sd=9 2012-11-19 20:38:51.599718 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 37 2012-11-19 20:38:51.599723 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl cluster_uuid = 37 bytes 2012-11-19 20:38:51.599718 7fc13ccbd700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c480 sd=19 :6789 pgs=0 cs=0 l=0).accept sd=19 2012-11-19 20:38:51.599729 7fc13fdc6780 10 mon.vault01@-1(probing) e1 check_fsid cluster_uuid contains '4d7d8d20-338c-4bdc-9918-9bcf04f9a832' 2012-11-19 20:38:51.599739 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 75 2012-11-19 20:38:51.599745 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl feature_set = 75 bytes 2012-11-19 20:38:51.599751 7fc13fdc6780 10 mon.vault01@-1(probing) e1 features compat={},rocompat={},incompat={1=initial feature set (~v.18)} 2012-11-19 20:38:51.599759 7fc13fdc6780 15 store(/ceph/mon.vault01) exists_bl joined 2012-11-19 20:38:51.599769 7fc13fdc6780 10 mon.vault01@-1(probing) e1 has_ever_joined = 1 2012-11-19 20:38:51.599794 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int pgmap/last_committed = 13 2012-11-19 20:38:51.599801 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int pgmap/first_committed = 132833 2012-11-19 20:38:51.599810 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 239840 2012-11-19 20:38:51.599928 7fc13cbbc700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213cd80 sd=20 :6789 pgs=0 cs=0 l=0).accept sd=20 2012-11-19 20:38:51.599950 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl pgmap/latest = 239840 bytes --- begin dump of recent events ---2012-11-19 20:38:51.600509 7fc13fdc6780 -1 *** Caught signal (Aborted) ** in thread 7fc13fdc6780 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150) 1: ceph-mon() [0x53adf8] 2: (()+0xfe90) [0x7fc141830e90] 3: (gsignal()+0x3e) [0x7fc140016dae] 4: (abort()+0x17b) [0x7fc14001825b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc141af300d] 6: (()+0xb31b6) [0x7fc141af11b6] 7: (()+0xb31e3) [0x7fc141af11e3] 8: (()+0xb32de) [0x7fc141af12de] 9: ceph-mon() [0x5ecb9f] 10: (Paxos::get_stashed(ceph::buffer::list)+0x1ed) [0x49e28d] 11: (Paxos::init()+0x109) [0x49e609] 12: (Monitor::init()+0x36a) [0x485a4a] 13: (main()+0x1289) [0x46d909] 14: (__libc_start_main()+0xed) [0x7fc14000364d] 15: ceph-mon() [0x46fa09] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. -55 2012-11-19 20:38:51.596694 7fc13fdc6780 5 asok(0x213d000) register_command perfcounters_dump hook 0x2131050 -55 2012-11-19 20:38:51.596720
Re: Files lost after mds rebuild
On Mon, Nov 19, 2012 at 7:55 AM, Drunkard Zhang gongfan...@gmail.com wrote: I created a ceph cluster for test, here's mistake I made: Add a second mds: mds.ab, executed 'ceph mds set_max_mds 2', then removed the mds just added; Then 'ceph mds set_max_mds 1', the first mds.aa crashed, and became laggy. As I can't repair mds.aa, so did 'ceph mds newfs metadata data --yes-i-really-mean-it'; So this command is a mkfs sort of thing. It's deleted all the allocation tables and filesystem metadata in favor of new, empty ones. You should not run --yes-i-really-mean-it commands if you don't know exactly what the command is doing and why you're using it. mds.aa was back, but 1TB data was in cluster lost, but disk space still used, by 'ceps -s'. Is there any chance I can get my data back? If can't, how can I retrieve back the disk space. There's not currently a great way to get that data back. With sufficient energy it could be re-constructed by looking through all the RADOS objects and putting something together. To retrieve the disk space, you'll want to delete the data and metadata RADOS pools. This will of course *eliminate* the data you have in your new filesystem, so grab that out first if there's anything there you care about. Then create the pools and run the newfs command again. Also, you've got the syntax wrong on that newfs command. You should be using pool IDs: ceph mds newfs 1 0 --yes-i-really-mean-it (Though these IDs may change after re-creating the pools.) -Greg Now it looks like: log3 ~ # ceph -s health HEALTH_OK monmap e1: 1 mons at {log3=10.205.119.2:6789/0}, election epoch 0, quorum 0 log3 osdmap e1555: 28 osds: 20 up, 20 in pgmap v56518: 960 pgs: 960 active+clean; 1134 GB data, 2306 GB used, 51353 GB / 55890 GB avail mdsmap e703: 1/1/1 up {0=aa=up:active}, 1 up:standby log3 ~ # df | grep osd |sort /dev/sdb1 2.8T 124G 2.5T 5% /ceph/osd.0 /dev/sdc1 2.8T 104G 2.6T 4% /ceph/osd.1 /dev/sdd1 2.8T 84G 2.6T 4% /ceph/osd.2 /dev/sde1 2.8T 117G 2.6T 5% /ceph/osd.3 /dev/sdf1 2.8T 105G 2.6T 4% /ceph/osd.4 /dev/sdg1 2.8T 84G 2.6T 4% /ceph/osd.5 /dev/sdh1 2.8T 140G 2.5T 6% /ceph/osd.6 /dev/sdi1 2.8T 134G 2.5T 5% /ceph/osd.8 /dev/sdj1 2.8T 112G 2.6T 5% /ceph/osd.7 /dev/sdk1 2.8T 159G 2.5T 6% /ceph/osd.9 /dev/sdl1 2.8T 126G 2.5T 5% /ceph/osd.10 osd on another host didn't show. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is the disk on MDS used for journal?
On Sun, Nov 18, 2012 at 7:14 PM, liu yaqi liuyaqiy...@gmail.com wrote: Is the disk on MDS used for journal? Does it has some other use? The MDS doesn't make any use of local disk space — it stores everything in RADOS. You need enough local disk to provide a configuration file, keyring, and debug logging (if you want those things). -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD network failure
On Fri, Nov 16, 2012 at 5:56 PM, Josh Durgin josh.dur...@inktank.com wrote: On 11/15/2012 01:51 AM, Gandalf Corvotempesta wrote: 2012/11/15 Josh Durgin josh.dur...@inktank.com: So basically you'd only need a single nic per storage node. Multiple can be useful to separate frontend and backend traffic, but ceph is designed to maintain strong consistency when failures occur. Probably i've not exaplained well. I'll have multiple nics, one for frontend, one for backend used as ODS sync network. What happens in case of backend network failure? The frontend network is still ok, OSD is still reachable but is not able to sync datas. Ah, ok. By default, the OSDs use the backend network for heartbeats, so if it fails, they will notice and report peers they can't reach as failed to the monitors, and the normal failure handling takes care of things. If you're worried about consistency, remember that a write won't complete until it's on disk on all replicas. If you're interested in the gory details of maintaining consistency, check out the peering process [1]. Josh [1] http://ceph.com/docs/master/dev/peering/ Actually, right now a failed cluster and an up public network is something the OSDs do not handle well — they will mark each other down on the monitor and then tell the monitor hey, I'm not dead! and start flapping pretty horrendously. We first ran across it a couple weeks ago and have started to think about it, but I'm not sure a fix for this is going to make it into the initial Bobtail release. :( -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Unused doc/images/.jpg files
Hi - There are several jpg files in the doc/images directory of the tarball that don't seem to be used in the html files or man pages after docs are built. If they are used somewhere - where is that what am I missing? Some of the .png files are used. root@84Server:~/ceph-ceph-fd4b839# ls doc/images/ AccessMethods.jpg RADOS.jpg chef.png lightstack.png radosStack.svg techstack.png CEPHConfig.jpg RBD.jpg chef.svg lightstack.svg stack.png techstack.svg CRUSH.jpg RDBSnapshots.jpg docreviewprocess.jpg osdStack.svg stack.svg Server:~/ceph-ceph-fd4b839# grep -R osdStack.svg * Server:~/ceph-ceph-fd4b839# grep -R techstack.png * doc/images/techstack.svg: inkscape:export-filename=/home/johnw/ceph/doc/images/techstack.png doc/images/radosStack.svg: inkscape:export-filename=/home/johnw/ceph/doc/images/techstack.png Server:~/ceph-ceph-fd4b839# grep -R stack.png * Binary file build-doc/doctrees/index.doctree matches Binary file build-doc/doctrees/environment.pickle matches build-doc/output/html/index.html:img alt=_images/stack.png src=_images/stack.png / build-doc/output/html/_sources/index.txt:.. image:: images/stack.png doc/index.rst:.. image:: images/stack.png doc/images/techstack.svg: inkscape:export-filename=/home/johnw/ceph/doc/images/techstack.png doc/images/radosStack.svg: inkscape:export-filename=/home/johnw/ceph/doc/images/techstack.png doc/images/stack.svg: inkscape:export-filename=/home/johnw/ceph/doc/images/stack.png doc/images/lightstack.svg: inkscape:export-filename=/home/johnw/ceph/doc/images/lightstack.png /tmp/ceph-ceph-fd4b839 Server:~/ceph-ceph-fd4b839# find . -name *.jpg -print ./doc/images/RADOS.jpg ./doc/images/CRUSH.jpg ./doc/images/AccessMethods.jpg ./doc/images/docreviewprocess.jpg ./doc/images/CEPHConfig.jpg ./doc/images/RDBSnapshots.jpg ./doc/images/RBD.jpg Server:~/ceph-ceph-fd4b839# grep -R AccessMethods * Server:~/ceph-ceph-fd4b839# grep -R CEPHConfig.jpg * Server:~/ceph-ceph-fd4b839# grep -R RBD.jpg * Server:~/ceph-ceph-fd4b839# grep -R RADOS.jpg * Thanks, Tim -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deprecating mkcephfs (the arrival of light-weight deployment tools)
On Mon, 19 Nov 2012, Isaac Otsiabah wrote: I am trying to understand ceph deployment direction because from this link http://ceph.com/docs/master/rados/deployment/ it is mentioned that mkcephfs is dreprecated. It also has the statement below which mentions light-weight deployment scripts to help you evaluate Ceph. We provide light-weight deployment scripts to help you evaluate Ceph. For professional deployment, you should consider professional deployment systems such as Juju, Puppet, Chef or Crowbar. I think there is a need to have native ceph deployment tools that aren't dependent upon any third party deployment tools. So my question is this 1. when will the light-weight deployment scripts be available and which ceph version will they be released into? http://github.com/ceph/ceph-deploy is available for initial testing, but far from ready for widespread use. mkcephfs is still the preferred installation path. I'll make sure the 'deprecated' notation is removed until a real alternative is ready. 2. Now, going forward, when will mkcephfs not work anymore (what ceph version)? It will be maintained at least through cuttlefish (the next stable release), though probably longer, so that there is plenty of overlap with whatever tool will follow. sage
Re: some snapshot problems
On Sun, Nov 11, 2012 at 11:02 PM, liu yaqi liuyaqiy...@gmail.com wrote: 2012/11/9 Sage Weil s...@inktank.com Lots of different snapshots: - librados lets you do 'selfmanaged snaps' in its API, which let an application control which snapshots apply to which objects. - you can create a 'pool' snapshot on an entire librados pool. this cannot be used at the same time as rbd, fs, or the above 'selfmanaged' snaps. - rbd let's you snapshot block device images (by usuing the librados selfmanaged snap API). - the ceph file system let's you snapshot any subdirectory (again utilizing the underlying RADOS functionality). I am confused about the concept of pool and image. Is one pool a set of placement groups? When I snap an image, does it mean a snapshot of one disk? A pool is a logical namespace into which you place objects. Placement groups are shards of a pool. Snapping an image makes use of the self-managed snapshot infrastructure, and takes a snapshot of one RBD volume (so yes, if that's what you meant by a snapshot of one disk). I think snapshot is used to preserve the state of directory at one time, and I wander if there will be a situation that I preserve the data of the directory, but does not preserve the metadata of the directory——maybe because metadata and data not in the same pool, would this happen? The Ceph filesystem builds a bit more on top of the RADOS snapshots — metadata and data are almost never in the same pool, and the metadata snapshots don't use RADOS snapshots anyway. When snap directory, I trace the code in mds, there is snapinfo added to inode, but where and when to create the content of the snap? What is the data sturcture of the snap content? When client set inode attribute, if snapid==NOSNAP, return, does this mean if the inode has been snapped, it can not be changed? So, snap not using the copy-on-write method(create snap, then change the content of snapfile when set inode attribute or write the file)? If not copy-on-write, what's the snap workflow for directory? You want to look into the code surrounding SnapRealms to see how the metadata for snapshots is managed. There are multi metadata nodes, one directory may spread over multi servers, one server has one part of the dir, how ceph resolves this problem? This also cause clock problem. It's not easy, but again, look at how the SnapRealms are dealt with. The MDSes will do synchronous notifications to each other. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd map command hangs for 15 minutes during system start up
Making 'mon clock drift allowed' very small (0.1) does not reliably reproduce the hang. I started looking at the code for 0.48.2 and it looks like this is only used in Paxos::warn_on_future_time, which only handles the warning, nothing else. On Fri, Nov 16, 2012 at 2:21 PM, Sage Weil s...@inktank.com wrote: On Fri, 16 Nov 2012, Nick Bartos wrote: Should I be lowering the clock drift allowed, or the lease interval to help reproduce it? clock drift allowed. On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil s...@inktank.com wrote: You can safely set the clock drift allowed as high as 500ms. The real limitation is that it needs to be well under the lease interval, which is currently 5 seconds by default. You might be able to reproduce more easily by lowering the threshold... sage On Fri, 16 Nov 2012, Nick Bartos wrote: How far off do the clocks need to be before there is a problem? It would seem to be hard to ensure a very large cluster has all of it's nodes synchronized within 50ms (which seems to be the default for mon clock drift allowed). Does the mon clock drift allowed parameter change anything other than the log messages? Are there any other tuning options that may help, assuming that this is the issue and it's not feasible to get the clocks more than 500ms in sync between all nodes? I'm trying to get a good way of reproducing this and get a trace on the ceph processes to see what they're waiting on. I'll let you know when I have more info. On Fri, Nov 16, 2012 at 11:16 AM, Sage Weil s...@inktank.com wrote: I just realized I was mixing up this thread with the other deadlock thread. On Fri, 16 Nov 2012, Nick Bartos wrote: Turns out we're having the 'rbd map' hang on startup again, after we started using the wip-3.5 patch set. How critical is the libceph_protect_ceph_con_open_with_mutex commit? That's the one I removed before which seemed to get rid of the problem (although I'm not completely sure if it completely got rid of it, at least seemed to happen much less often). It seems like we only started having this issue after we started patching the 3.5 ceph client (we started patching to try and get rid of a kernel oops, which the patches seem to have fixed). Right. That patch fixes a real bug. It also seems pretty unlikely that this patch is related to the startup hang. The original log showed clock drift on the monitor that could very easily cause this sort of hang. Can you confirm that that isn't the case with this recent instance of the problem? And/or attach a log? Thanks- sage On Thu, Nov 15, 2012 at 4:25 PM, Sage Weil s...@inktank.com wrote: On Thu, 15 Nov 2012, Nick Bartos wrote: Sorry I guess this e-mail got missed. I believe those patches came from the ceph/linux-3.5.5-ceph branch. I'm now using the wip-3.5 branch patches, which seem to all be fine. We'll stick with 3.5 and this backport for now until we can figure out what's wrong with 3.6. I typically ignore the wip branches just due to the naming when I'm looking for updates. Where should I typically look for updates that aren't in released kernels? Also, is there anything else in the wip* branches that you think we may find particularly useful? You were looking in the right place. The problem was we weren't super organized with our stable patches, and changed our minds about what to send upstream. These are 'wip' in the sense that they were in preparation for going upstream. The goal is to push them to the mainline stable kernels and ideally not keep them in our tree at all. wip-3.5 is an oddity because the mainline stable kernel is EOL'd, but we're keeping it so that ubuntu can pick it up for quantal. I'll make sure these are more clearly marked as stable. sage On Mon, Nov 12, 2012 at 3:16 PM, Sage Weil s...@inktank.com wrote: On Mon, 12 Nov 2012, Nick Bartos wrote: After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it seems we no longer have this hang. Hmm, that's a bit disconcerting. Did this series come from our old 3.5 stable series? I recently prepared a new one that backports *all* of the fixes from 3.6 to 3.5 (and 3.4); see wip-3.5 in ceph-client.git. I would be curious if you see problems with that. So far, with these fixes in place, we have not seen any unexplained kernel crashes in this code. I take it you're going back to a 3.5 kernel because you weren't able to get rid of the sync problem with 3.6? sage On Thu, Nov 8, 2012 at 5:43 PM, Josh Durgin josh.dur...@inktank.com wrote: On 11/08/2012 02:10 PM, Mandell Degerness wrote: We are seeing a somewhat
Re: librbd discard bug problems - i got it
mhm qemu rbd block driver. Get's always these errors back. As rbd_aio_bh_cb is directly called from librbd the problem must be there. Strangely i can't find where rbd_aio_bh_cb get's called with -512. ANy further ideas? rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -1006628352 Error: 0 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd recovery extremely slow with current master
Which version was this on? There was some fairly significant work to recovery done to introduce a reservation scheme and some other stuff that might need some different defaults. -Greg On Tue, Nov 13, 2012 at 12:33 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi list, osd recovery seems to be really slow with current master. I see only 1-8 active+recovering out of 1200. Even there's no load on ceph cluster. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
objectcacher lru eviction causes assert
Hi All, We've been fixing a number of objectcacher bugs to handle races between slow osd commit replies and various other operations like truncate. I ran into another problem earlier today with a race between an object getting evicted from the lru cache (via readx - trim) and the osd commit reply. The assertion trace is below. We've avoided keeping a reference to the object during the commit, but that means that the object isn't pinned in the lru, and so can come up for eviction. When it gets evicted, we close the object and hit the assertion, which we can't do - because we need the object to finish the commit. I've pushed a change that needs review in the wip-3431 branch. It allows the the object to be evicted from the lru cache, but checks that it can be closed (as we do elsewhere) - and if not, lets the commit handle the close (via flush...release). The assertion we hit is: 2012-11-19 09:06:35.187910 7ff143e2f780 1 osdc/ObjectCacher.cc: In function 'void ObjectCacher::close_object(ObjectCacher::Object*)' thread 7ff143e2f780 time 2012-11-19 09:06:35.186379 osdc/ObjectCacher.cc: 577: FAILED assert(obcan_close()) ceph version 0.54-641-g4c69f86 (4c69f865ca79328c62635ae32c91bd32b3985613) 1: (ObjectCacher::close_object(ObjectCacher::Object*)+0x135) [0x5c78d5] 2: (ObjectCacher::trim(long, long)+0x820) [0x5c94d0] 3: (ObjectCacher::_readx(ObjectCacher::OSDRead*, ObjectCacher::ObjectSet*, Context*, bool)+0x21ad) [0x5d92dd] 4: (Client::_read_async(Fh*, unsigned long, unsigned long, ceph::buffer::list*)+0x3e9) [0x486c09] 5: (Client::_read(Fh*, long, unsigned long, ceph::buffer::list*)+0x265) [0x49bd65] 6: (Client::ll_read(Fh*, long, long, ceph::buffer::list*)+0x97) [0x49be87] 7: /tmp/cephtest/binary/usr/local/bin/ceph-fuse() [0x4733cf] 8: (()+0x12d5e) [0x7ff1439fdd5e] 9: (fuse_session_loop()+0x75) [0x7ff1439fbd65] 10: (ceph_fuse_ll_main(Client*, int, char const**, int)+0x225) [0x474245] 11: (main()+0x42f) [0x4716ef] 12: (__libc_start_main()+0xed) [0x7ff141ebd76d] 13: /tmp/cephtest/binary/usr/local/bin/ceph-fuse() [0x472e95] -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Removed directory is back in the Ceph FS
On Tue, Nov 13, 2012 at 3:23 AM, Franck Marchand fmarch...@agaetis.fr wrote: Hi, I have a weird pb. I remove a folder using a mounted fs partition. I did it and it worked well. What client are you using? How did you delete it? (rm -rf, etc?) Are you using multiple clients or one, and did you check it on a different client? I checked later to see if I had all my folders in ceph fs ... : the folder I removed was back and I can't remove it ! Here is the error message I got : rm -rf 2012-11-10/ rm cannot remove `2012-11-10': Directory not empty This folder is empty ... So anybody had the same pb ? Am I doing something wrong ? This sounds like a known but undiagnosed problem with the MDS rstats. The part where your client reported success is a new wrinkle, though. -Greg Thx -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: librbd discard bug problems - i got it
On 11/19/2012 03:16 PM, Stefan Priebe wrote: mhm qemu rbd block driver. Get's always these errors back. As rbd_aio_bh_cb is directly called from librbd the problem must be there. Strangely i can't find where rbd_aio_bh_cb get's called with -512. ANy further ideas? Two ideas: 1) Is rbd_finish_aiocb getting this same return value? 2) Perhaps it's an issue with the return value wrapping around with very large discards. Adding some logging of the return values of each rados operation in AioCompletion::complete_request() might give us a clue. These large negative return values are suspicious. rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -512 Error: 0 rbd_aio_bh_cb got error back. Code: -1006628352 Error: 0 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph-osd crashing (os/FileStore.cc: 4500: FAILED assert(replaying))
Am 20.11.2012 00:39, schrieb Samuel Just: Seems to be a truncated log file... That usually indicates filesystem corruption. Anything in dmesg? -Sam No. Everything fine. On Thu, Nov 15, 2012 at 1:07 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hello list, actual master incl. upstream/wip-fd-simple-cache results in this crash when i try to start some of my osds (others work fine) today on multiple nodes: -2 2012-11-15 22:04:09.226945 7f3af1c7a780 0 osd.52 pg_epoch: 657 pg[3.3b( v 632'823 (632'823,632'823] n=5 ec=17 les/c 18/18 656/656/17) [] r=0 lpr=0 pi=17-655/2 (info mismatch, log(632'823,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] Got exception 'read_log_error: read_log got 0 bytes, expected 126086-0=126086' while reading log. Moving corrupted log file to 'corrupt_log_2012-11-15_22:04_3.3b' for later analysis. -1 2012-11-15 22:04:09.233563 7f3af1c7a780 0 osd.52 pg_epoch: 657 pg[3.557( v 632'753 (0'0,632'753] n=2 ec=17 les/c 18/18 656/656/17) [] r=0 lpr=0 pi=17-655/2 (info mismatch, log(0'0,0'0]) lcod 0'0 mlcod 0'0 inactive] Got exception 'read_log_error: read_log got 0 bytes, expected 115488-0=115488' while reading log. Moving corrupted log file to 'corrupt_log_2012-11-15_22:04_3.557' for later analysis. 0 2012-11-15 22:04:09.234536 7f3ae87d0700 -1 os/FileStore.cc: In function 'int FileStore::_collection_add(coll_t, coll_t, const hobject_t, const SequencerPosition)' thread 7f3ae87d0700 time 2012-11-15 22:04:09.233672 os/FileStore.cc: 4500: FAILED assert(replaying) ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039) 1: (FileStore::_collection_add(coll_t, coll_t, hobject_t const, SequencerPosition const)+0x77d) [0x72ff0d] 2: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int)+0x25fb) [0x73481b] 3: (FileStore::do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long)+0x4c) [0x73952c] 4: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b] 6: (ThreadPool::WorkThread::entry()+0x10) [0x833700] 7: (()+0x68ca) [0x7f3af16578ca] 8: (clone()+0x6d) [0x7f3aefac6bfd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 0 journaler 0/ 5 objectcacher 0/ 5 client 0/ 0 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 1/ 5 mon 0/ 0 monc 0/ 5 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 100 log_file /var/log/ceph/ceph-osd.52.log --- end dump of recent events --- 2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal (Aborted) ** in thread 7f3ae87d0700 ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039) 1: /usr/bin/ceph-osd() [0x799769] 2: (()+0xeff0) [0x7f3af165fff0] 3: (gsignal()+0x35) [0x7f3aefa29215] 4: (abort()+0x180) [0x7f3aefa2c020] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f3af02bddc5] 6: (()+0xcb166) [0x7f3af02bc166] 7: (()+0xcb193) [0x7f3af02bc193] 8: (()+0xcb28e) [0x7f3af02bc28e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x7fd069] 10: (FileStore::_collection_add(coll_t, coll_t, hobject_t const, SequencerPosition const)+0x77d) [0x72ff0d] 11: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int)+0x25fb) [0x73481b] 12: (FileStore::do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long)+0x4c) [0x73952c] 13: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45] 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b] 15: (ThreadPool::WorkThread::entry()+0x10) [0x833700] 16: (()+0x68ca) [0x7f3af16578ca] 17: (clone()+0x6d) [0x7f3aefac6bfd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- 0 2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal (Aborted) ** in thread 7f3ae87d0700 ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039) 1: /usr/bin/ceph-osd() [0x799769] 2: (()+0xeff0) [0x7f3af165fff0] 3: (gsignal()+0x35) [0x7f3aefa29215] 4: (abort()+0x180) [0x7f3aefa2c020] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f3af02bddc5] 6: (()+0xcb166) [0x7f3af02bc166] 7: (()+0xcb193)
Re: librbd discard bug problems - i got it
Am 20.11.2012 00:33, schrieb Josh Durgin: On 11/19/2012 03:16 PM, Stefan Priebe wrote: mhm qemu rbd block driver. Get's always these errors back. As rbd_aio_bh_cb is directly called from librbd the problem must be there. Strangely i can't find where rbd_aio_bh_cb get's called with -512. ANy further ideas? Two ideas: 1) Is rbd_finish_aiocb getting this same return value? Will check this tomorrow. 2) Perhaps it's an issue with the return value wrapping around with very large discards. Adding some logging of the return values of each rados operation in AioCompletion::complete_request() might give us a clue. These large negative return values are suspicious. Good idea. As r and rval is int it is limited. But AioCompletion::complete_request is adding more and more stuff to rval. What could be a solution? Bump rval to int64? Or wrap to around to start at 0 again? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH, v2] rbd: do not allow remove of mounted-on image
There is no check in rbd_remove() to see if anybody holds open the image being removed. That's not cool. Add a simple open count that goes up and down with opens and closes (releases) of the device, and don't allow an rbd image to be removed if the count is non-zero. Protect the updates of the open count value with ctl_mutex to ensure the underlying rbd device doesn't get removed while concurrently being opened. Signed-off-by: Alex Elder el...@inktank.com --- v2: added ctl_mutex locking for rbd_open() and rbd_release() drivers/block/rbd.c | 13 + 1 file changed, 13 insertions(+) Index: b/drivers/block/rbd.c === --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -255,6 +255,7 @@ struct rbd_device { /* sysfs related */ struct device dev; + unsigned long open_count; }; static DEFINE_MUTEX(ctl_mutex); /* Serialize open/close/setup/teardown */ @@ -356,8 +357,11 @@ static int rbd_open(struct block_device if ((mode FMODE_WRITE) rbd_dev-mapping.read_only) return -EROFS; + mutex_lock_nested(ctl_mutex, SINGLE_DEPTH_NESTING); rbd_get_dev(rbd_dev); set_device_ro(bdev, rbd_dev-mapping.read_only); + rbd_dev-open_count++; + mutex_unlock(ctl_mutex); return 0; } @@ -366,7 +370,11 @@ static int rbd_release(struct gendisk *d { struct rbd_device *rbd_dev = disk-private_data; + mutex_lock_nested(ctl_mutex, SINGLE_DEPTH_NESTING); + rbd_assert(rbd_dev-open_count 0); + rbd_dev-open_count--; rbd_put_dev(rbd_dev); + mutex_unlock(ctl_mutex); return 0; } @@ -3764,6 +3772,11 @@ static ssize_t rbd_remove(struct bus_typ goto done; } + if (rbd_dev-open_count) { + ret = -EBUSY; + goto done; + } + rbd_remove_all_snaps(rbd_dev); rbd_bus_del_dev(rbd_dev); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph-osd crashing (os/FileStore.cc: 4500: FAILED assert(replaying))
Can you restart one of the affected osds with debug osd = 20, debug filestore = 20, debug ms = 1 and post the log? -Sam On Mon, Nov 19, 2012 at 3:39 PM, Stefan Priebe s.pri...@profihost.ag wrote: Am 20.11.2012 00:39, schrieb Samuel Just: Seems to be a truncated log file... That usually indicates filesystem corruption. Anything in dmesg? -Sam No. Everything fine. On Thu, Nov 15, 2012 at 1:07 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hello list, actual master incl. upstream/wip-fd-simple-cache results in this crash when i try to start some of my osds (others work fine) today on multiple nodes: -2 2012-11-15 22:04:09.226945 7f3af1c7a780 0 osd.52 pg_epoch: 657 pg[3.3b( v 632'823 (632'823,632'823] n=5 ec=17 les/c 18/18 656/656/17) [] r=0 lpr=0 pi=17-655/2 (info mismatch, log(632'823,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] Got exception 'read_log_error: read_log got 0 bytes, expected 126086-0=126086' while reading log. Moving corrupted log file to 'corrupt_log_2012-11-15_22:04_3.3b' for later analysis. -1 2012-11-15 22:04:09.233563 7f3af1c7a780 0 osd.52 pg_epoch: 657 pg[3.557( v 632'753 (0'0,632'753] n=2 ec=17 les/c 18/18 656/656/17) [] r=0 lpr=0 pi=17-655/2 (info mismatch, log(0'0,0'0]) lcod 0'0 mlcod 0'0 inactive] Got exception 'read_log_error: read_log got 0 bytes, expected 115488-0=115488' while reading log. Moving corrupted log file to 'corrupt_log_2012-11-15_22:04_3.557' for later analysis. 0 2012-11-15 22:04:09.234536 7f3ae87d0700 -1 os/FileStore.cc: In function 'int FileStore::_collection_add(coll_t, coll_t, const hobject_t, const SequencerPosition)' thread 7f3ae87d0700 time 2012-11-15 22:04:09.233672 os/FileStore.cc: 4500: FAILED assert(replaying) ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039) 1: (FileStore::_collection_add(coll_t, coll_t, hobject_t const, SequencerPosition const)+0x77d) [0x72ff0d] 2: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int)+0x25fb) [0x73481b] 3: (FileStore::do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long)+0x4c) [0x73952c] 4: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b] 6: (ThreadPool::WorkThread::entry()+0x10) [0x833700] 7: (()+0x68ca) [0x7f3af16578ca] 8: (clone()+0x6d) [0x7f3aefac6bfd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 0 journaler 0/ 5 objectcacher 0/ 5 client 0/ 0 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 1/ 5 mon 0/ 0 monc 0/ 5 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 100 log_file /var/log/ceph/ceph-osd.52.log --- end dump of recent events --- 2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal (Aborted) ** in thread 7f3ae87d0700 ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039) 1: /usr/bin/ceph-osd() [0x799769] 2: (()+0xeff0) [0x7f3af165fff0] 3: (gsignal()+0x35) [0x7f3aefa29215] 4: (abort()+0x180) [0x7f3aefa2c020] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f3af02bddc5] 6: (()+0xcb166) [0x7f3af02bc166] 7: (()+0xcb193) [0x7f3af02bc193] 8: (()+0xcb28e) [0x7f3af02bc28e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x7fd069] 10: (FileStore::_collection_add(coll_t, coll_t, hobject_t const, SequencerPosition const)+0x77d) [0x72ff0d] 11: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int)+0x25fb) [0x73481b] 12: (FileStore::do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long)+0x4c) [0x73952c] 13: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45] 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b] 15: (ThreadPool::WorkThread::entry()+0x10) [0x833700] 16: (()+0x68ca) [0x7f3af16578ca] 17: (clone()+0x6d) [0x7f3aefac6bfd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- 0 2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal (Aborted) ** in thread 7f3ae87d0700 ceph version
Re: ceph-osd crashing (os/FileStore.cc: 4500: FAILED assert(replaying))
I've formatted the cluster since then. But i'll report back if this happens again. Stefan Am 20.11.2012 00:43, schrieb Samuel Just: Can you restart one of the affected osds with debug osd = 20, debug filestore = 20, debug ms = 1 and post the log? -Sam On Mon, Nov 19, 2012 at 3:39 PM, Stefan Priebe s.pri...@profihost.ag wrote: Am 20.11.2012 00:39, schrieb Samuel Just: Seems to be a truncated log file... That usually indicates filesystem corruption. Anything in dmesg? -Sam No. Everything fine. On Thu, Nov 15, 2012 at 1:07 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hello list, actual master incl. upstream/wip-fd-simple-cache results in this crash when i try to start some of my osds (others work fine) today on multiple nodes: -2 2012-11-15 22:04:09.226945 7f3af1c7a780 0 osd.52 pg_epoch: 657 pg[3.3b( v 632'823 (632'823,632'823] n=5 ec=17 les/c 18/18 656/656/17) [] r=0 lpr=0 pi=17-655/2 (info mismatch, log(632'823,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] Got exception 'read_log_error: read_log got 0 bytes, expected 126086-0=126086' while reading log. Moving corrupted log file to 'corrupt_log_2012-11-15_22:04_3.3b' for later analysis. -1 2012-11-15 22:04:09.233563 7f3af1c7a780 0 osd.52 pg_epoch: 657 pg[3.557( v 632'753 (0'0,632'753] n=2 ec=17 les/c 18/18 656/656/17) [] r=0 lpr=0 pi=17-655/2 (info mismatch, log(0'0,0'0]) lcod 0'0 mlcod 0'0 inactive] Got exception 'read_log_error: read_log got 0 bytes, expected 115488-0=115488' while reading log. Moving corrupted log file to 'corrupt_log_2012-11-15_22:04_3.557' for later analysis. 0 2012-11-15 22:04:09.234536 7f3ae87d0700 -1 os/FileStore.cc: In function 'int FileStore::_collection_add(coll_t, coll_t, const hobject_t, const SequencerPosition)' thread 7f3ae87d0700 time 2012-11-15 22:04:09.233672 os/FileStore.cc: 4500: FAILED assert(replaying) ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039) 1: (FileStore::_collection_add(coll_t, coll_t, hobject_t const, SequencerPosition const)+0x77d) [0x72ff0d] 2: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int)+0x25fb) [0x73481b] 3: (FileStore::do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long)+0x4c) [0x73952c] 4: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b] 6: (ThreadPool::WorkThread::entry()+0x10) [0x833700] 7: (()+0x68ca) [0x7f3af16578ca] 8: (clone()+0x6d) [0x7f3aefac6bfd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 0 journaler 0/ 5 objectcacher 0/ 5 client 0/ 0 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 1/ 5 mon 0/ 0 monc 0/ 5 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 100 log_file /var/log/ceph/ceph-osd.52.log --- end dump of recent events --- 2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal (Aborted) ** in thread 7f3ae87d0700 ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039) 1: /usr/bin/ceph-osd() [0x799769] 2: (()+0xeff0) [0x7f3af165fff0] 3: (gsignal()+0x35) [0x7f3aefa29215] 4: (abort()+0x180) [0x7f3aefa2c020] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f3af02bddc5] 6: (()+0xcb166) [0x7f3af02bc166] 7: (()+0xcb193) [0x7f3af02bc193] 8: (()+0xcb28e) [0x7f3af02bc28e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x7fd069] 10: (FileStore::_collection_add(coll_t, coll_t, hobject_t const, SequencerPosition const)+0x77d) [0x72ff0d] 11: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int)+0x25fb) [0x73481b] 12: (FileStore::do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long)+0x4c) [0x73952c] 13: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45] 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b] 15: (ThreadPool::WorkThread::entry()+0x10) [0x833700] 16: (()+0x68ca) [0x7f3af16578ca] 17: (clone()+0x6d) [0x7f3aefac6bfd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- 0 2012-11-15 22:04:09.235734 7f3ae87d0700
Re: librbd discard bug problems - i got it
On 11/19/2012 03:42 PM, Stefan Priebe wrote: Am 20.11.2012 00:33, schrieb Josh Durgin: On 11/19/2012 03:16 PM, Stefan Priebe wrote: mhm qemu rbd block driver. Get's always these errors back. As rbd_aio_bh_cb is directly called from librbd the problem must be there. Strangely i can't find where rbd_aio_bh_cb get's called with -512. ANy further ideas? Two ideas: 1) Is rbd_finish_aiocb getting this same return value? Will check this tomorrow. 2) Perhaps it's an issue with the return value wrapping around with very large discards. Adding some logging of the return values of each rados operation in AioCompletion::complete_request() might give us a clue. These large negative return values are suspicious. Good idea. As r and rval is int it is limited. But AioCompletion::complete_request is adding more and more stuff to rval. What could be a solution? Bump rval to int64? Or wrap to around to start at 0 again? The final return value is limited to int at a few levels. Probably it's best to make discard alway return 0 on success. aio_discard should already be doing this, but perhaps it's not in this case. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: librbd discard bug problems - i got it
Hi Josh, i don't get it. Every debug line i print is a prositive fine value. BUt rbd_aio_bh_cb get's called with these values. As you can see that are not much values i copied all values 0 from log for discarding a whole 30GB device. Stefan Am 20.11.2012 00:47, schrieb Josh Durgin: On 11/19/2012 03:42 PM, Stefan Priebe wrote: Am 20.11.2012 00:33, schrieb Josh Durgin: On 11/19/2012 03:16 PM, Stefan Priebe wrote: mhm qemu rbd block driver. Get's always these errors back. As rbd_aio_bh_cb is directly called from librbd the problem must be there. Strangely i can't find where rbd_aio_bh_cb get's called with -512. ANy further ideas? Two ideas: 1) Is rbd_finish_aiocb getting this same return value? Will check this tomorrow. 2) Perhaps it's an issue with the return value wrapping around with very large discards. Adding some logging of the return values of each rados operation in AioCompletion::complete_request() might give us a clue. These large negative return values are suspicious. Good idea. As r and rval is int it is limited. But AioCompletion::complete_request is adding more and more stuff to rval. What could be a solution? Bump rval to int64? Or wrap to around to start at 0 again? The final return value is limited to int at a few levels. Probably it's best to make discard alway return 0 on success. aio_discard should already be doing this, but perhaps it's not in this case. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: librbd discard bug problems - i got it
On 11/19/2012 04:00 PM, Stefan Priebe wrote: Hi Josh, i don't get it. Every debug line i print is a prositive fine value. BUt rbd_aio_bh_cb get's called with these values. As you can see that are not much values i copied all values 0 from log for discarding a whole 30GB device. Could you post the patch of the debug prints you added and the log? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libcephfs create file with layout and replication
On Sun, Nov 18, 2012 at 12:05 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: Wanna have a look at a first pass on this patch? wip-client-open-layout Thanks, Noah Just glanced over this, and I'm curious: 1) Why symlink another reference to your file_layout.h? 2) There's already a ceph_file_layout struct which is used widely (MDS, kernel, userspace client). It also has an accompanying function that does basic validity checks. On Sat, Nov 17, 2012 at 5:20 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: On Sat, Nov 17, 2012 at 4:15 PM, Sage Weil s...@inktank.com wrote: We ignore that for the purposes of getting the libcephfs API correct, though... Ok, make sense. Thanks. Noah FYI, there's an unused __le32 in the open struct (used to be for preferred PG). We should be able to steal that away without too much pain or massaging! :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't start ceph mon
Also, if you still have it, could you zip up your monitor data directory and put it somewhere accessible to us? (I can provide you a drop point if necessary.) We'd like to look at the file layouts a bit since we thought we were properly handling ENOSPC-style issues. -Greg On Mon, Nov 19, 2012 at 1:45 PM, Gregory Farnum g...@inktank.com wrote: On Mon, Nov 19, 2012 at 1:08 PM, Dave Humphreys (Datatone) d...@datatone.co.uk wrote: I have a problem in which I can't start my ceph monitor. The log is shown below. The log shows version 0.54. I was running 0.52 when the problem arose, and I moved to the latest in case the newer version fixed the problem. The original failure happened a week or so ago, and could have been as a result of running out of disk space when the ceph monitor log became huge. That is almost certainly the case, although I thought we were handling this issue better now. What should I do to recover the situation? Do you have other monitors in working order? The easiest way to handle it if that's the case is just to remove this monitor from the cluster and add it back in as a new monitor with a fresh store. If not we can look into reconstructing it. -Greg David 2012-11-19 20:38:51.598468 7fc13fdc6780 0 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150), process ceph-mon, pid 21012 2012-11-19 20:38:51.598482 7fc13fdc6780 1 store(/ceph/mon.vault01) mount 2012-11-19 20:38:51.598527 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 21 2012-11-19 20:38:51.598542 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl magic = 21 bytes 2012-11-19 20:38:51.598562 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 75 2012-11-19 20:38:51.598567 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl feature_set = 75 bytes 2012-11-19 20:38:51.598582 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 205 2012-11-19 20:38:51.598586 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl monmap/latest = 205 bytes 2012-11-19 20:38:51.598809 7fc13fdc6780 1 -- 10.0.1.1:6789/0 learned my addr 10.0.1.1:6789/0 2012-11-19 20:38:51.598818 7fc13fdc6780 1 accepter.accepter.bind my_inst.addr is 10.0.1.1:6789/0 need_addr=0 2012-11-19 20:38:51.599498 7fc13fdc6780 1 -- 10.0.1.1:6789/0 messenger.start 2012-11-19 20:38:51.599508 7fc13fdc6780 1 accepter.accepter.start 2012-11-19 20:38:51.599610 7fc13fdc6780 1 mon.vault01@-1(probing) e1 init fsid 4d7d8d20-338c-4bdc-9918-9bcf04f9a832 2012-11-19 20:38:51.599674 7fc13cdbe700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c6c0 sd=14 :6789 pgs=0 cs=0 l=0).accept sd=14 2012-11-19 20:38:51.599678 7fc141eff700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c240 sd=9 :6789 pgs=0 cs=0 l=0).accept sd=9 2012-11-19 20:38:51.599718 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 37 2012-11-19 20:38:51.599723 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl cluster_uuid = 37 bytes 2012-11-19 20:38:51.599718 7fc13ccbd700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213c480 sd=19 :6789 pgs=0 cs=0 l=0).accept sd=19 2012-11-19 20:38:51.599729 7fc13fdc6780 10 mon.vault01@-1(probing) e1 check_fsid cluster_uuid contains '4d7d8d20-338c-4bdc-9918-9bcf04f9a832' 2012-11-19 20:38:51.599739 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 75 2012-11-19 20:38:51.599745 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl feature_set = 75 bytes 2012-11-19 20:38:51.599751 7fc13fdc6780 10 mon.vault01@-1(probing) e1 features compat={},rocompat={},incompat={1=initial feature set (~v.18)} 2012-11-19 20:38:51.599759 7fc13fdc6780 15 store(/ceph/mon.vault01) exists_bl joined 2012-11-19 20:38:51.599769 7fc13fdc6780 10 mon.vault01@-1(probing) e1 has_ever_joined = 1 2012-11-19 20:38:51.599794 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int pgmap/last_committed = 13 2012-11-19 20:38:51.599801 7fc13fdc6780 15 store(/ceph/mon.vault01) get_int pgmap/first_committed = 132833 2012-11-19 20:38:51.599810 7fc13fdc6780 20 store(/ceph/mon.vault01) reading at off 0 of 239840 2012-11-19 20:38:51.599928 7fc13cbbc700 1 -- 10.0.1.1:6789/0 :/0 pipe(0x213cd80 sd=20 :6789 pgs=0 cs=0 l=0).accept sd=20 2012-11-19 20:38:51.599950 7fc13fdc6780 15 store(/ceph/mon.vault01) get_bl pgmap/latest = 239840 bytes --- begin dump of recent events ---2012-11-19 20:38:51.600509 7fc13fdc6780 -1 *** Caught signal (Aborted) ** in thread 7fc13fdc6780 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150) 1: ceph-mon() [0x53adf8] 2: (()+0xfe90) [0x7fc141830e90] 3: (gsignal()+0x3e) [0x7fc140016dae] 4: (abort()+0x17b) [0x7fc14001825b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc141af300d] 6: (()+0xb31b6) [0x7fc141af11b6] 7: (()+0xb31e3) [0x7fc141af11e3] 8: (()+0xb32de) [0x7fc141af12de] 9: ceph-mon() [0x5ecb9f] 10: (Paxos::get_stashed(ceph::buffer::list)+0x1ed) [0x49e28d] 11: (Paxos::init()+0x109) [0x49e609] 12: (Monitor::init()+0x36a) [0x485a4a] 13:
Request to join mailing group
-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libcephfs create file with layout and replication
On Mon, Nov 19, 2012 at 5:04 PM, Gregory Farnum g...@inktank.com wrote: Just glanced over this, and I'm curious: 1) Why symlink another reference to your file_layout.h? I followed the same pattern as page.h in librados, but may have misunderstood its use. When libcephfs.h is installed, it includes #include file_layout.h and we assume the user has -Iprefix/cephfs/. but in the build tree, include/cephfs isn't an includes path used, hence the symlink. 2) There's already a ceph_file_layout struct which is used widely (MDS, kernel, userspace client). It also has an accompanying function that does basic validity checks. I avoided ceph_file_layout because I was under the impression that all of the __le64 stuff in it was very much Linux-specific. I had run into a lot of this hacking on an OSX port. FYI, there's an unused __le32 in the open struct (used to be for preferred PG). We should be able to steal that away without too much pain or massaging! :) Nice. Do you think I should revert back to using ceph_file_layout? Thanks, Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libcephfs create file with layout and replication
On Mon, 19 Nov 2012, Noah Watkins wrote: On Mon, Nov 19, 2012 at 5:04 PM, Gregory Farnum g...@inktank.com wrote: Just glanced over this, and I'm curious: 1) Why symlink another reference to your file_layout.h? I followed the same pattern as page.h in librados, but may have misunderstood its use. When libcephfs.h is installed, it includes #include file_layout.h and we assume the user has -Iprefix/cephfs/. but in the build tree, include/cephfs isn't an includes path used, hence the symlink. 2) There's already a ceph_file_layout struct which is used widely (MDS, kernel, userspace client). It also has an accompanying function that does basic validity checks. I avoided ceph_file_layout because I was under the impression that all of the __le64 stuff in it was very much Linux-specific. I had run into a lot of this hacking on an OSX port. FYI, there's an unused __le32 in the open struct (used to be for preferred PG). We should be able to steal that away without too much pain or massaging! :) Nice. Do you think I should revert back to using ceph_file_layout? We could avoid the whole issue by passing 4 arguments to the function... -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Remote Ceph Install
On 11/19/2012 11:42 AM, Blackwell, Edward wrote: Hi, I work for Harris Corporation, and we are investigating Ceph as a potential solution to a storage problem that one of our government customers is currently having. I've already created a two-node cluster on a couple of VMs with another VM acting as an administrative client. The cluster was created using some installation instructions supplied to us via Inktank, and through the use of the ceph-deploy script. Aside from a couple of quirky discrepancies between the installation instructions and my environment, everything went well. My issue has cropped up on the second cluster I'm trying to create, which is using a VM and a non-VM server for the nodes in the cluster. Eventually, both nodes in this cluster will be non-VMs, but we're still waiting on the hardware for the second node, so I'm using a VM in the meantime just to get this second cluster up and going. Of course, the administrative client node is still a VM. Hi Ed. Welcome. The problem that I'm having with this second cluster concerns the non-VM server (elsceph01 for the sake of the commands mentioned from here on out). In particular, the issue crops up with the ceph-deploy install elsceph01 command I'm executing on my client VM (cephclient01) to install Ceph on the non-VM server. The installation doesn't appear to be working as the command does not return the OK message that it should when it completes successfully. I've tried using the verbose option on the command to see if that sheds any light on the subject, but alas, it does not: root@cephclient01:~/my-admin-sandbox# ceph-deploy -v install elsceph01 DEBUG:ceph_deploy.install:Installing stable version argonaut on cluster ceph hosts elsceph01 DEBUG:ceph_deploy.install:Detecting platform for host elsceph01 ... DEBUG:ceph_deploy.install:Installing for Ubuntu 12.04 on host elsceph01 ... root@cephclient01:~/my-admin-sandbox# Would you happen to have a breakdown of the commands being executed by the ceph-deploy script behind the scenes so I can maybe execute them one-by-one to see where the error is? I have confirmed that it looks like the installation of the software has succeeded as I did a which ceph command on elsceph01, and it reported back /usr/bin/ceph. Also, /etc/ceph/ceph.conf is there, and it matches the file created by the ceph-deploy new ... command on the client. Does the install command do a mkcephfs behind the scenes? The reason I ask is that when I do the ceph-deploy mon command from the client, which is the next command listed in the instructions to do, I get this output: Basically install just runs the appropriate debian package commands to get the requested release of Ceph installed on the target host (in this case, defaulting to argonaut). The command normally doesn't issue any output. root@cephclient01:~/my-admin-sandbox# ceph-deploy mon creating /var/lib/ceph/tmp/ceph-ELSCEPH01.mon.keyring This looks like there may be confusion about case in the hostname. What does hostname on elsceph01 report? If it's ELSCEPH01, that's probably the problem; the pathnames etc. are all case-sensitive. Could be that /etc/hosts has the wrong case, or both cases, of the hostname in it? 2012-11-15 11:35:38.954261 7f7a6c274780 -1 asok(0x260b000) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.ELSCEPH01.asok': (2) No such file or directory Traceback (most recent call last): File /usr/local/bin/ceph-deploy, line 9, in module load_entry_point('ceph-deploy==0.0.1', 'console_scripts', 'ceph-deploy')() File /root/ceph-deploy/ceph_deploy/cli.py, line 80, in main added entity mon. auth auth(auid = 18446744073709551615 key=AQBWDj5QAP6LHhAAskVBnUkYHJ7eYREmKo5qKA== with 0 caps) return args.func(args) mon/MonMap.h: In function 'void MonMap::add(const string, const entity_addr_t)' thread 7f7a6c274780 time 2012-11-15 11:35:38.955024 mon/MonMap.h: 97: FAILED assert(addr_name.count(addr) == 0) ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: (MonMap::build_from_host_list(std::string, std::string)+0x738) [0x5988b8] 2: (MonMap::build_initial(CephContext*, std::ostream)+0x113) [0x59bd53] 3: (main()+0x12bb) [0x45ffab] 4: (__libc_start_main()+0xed) [0x7f7a6a6d776d] 5: ceph-mon() [0x462a19] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 2012-11-15 11:35:38.955924 7f7a6c274780 -1 mon/MonMap.h: In function 'void MonMap::add(const string, const entity_addr_t)' thread 7f7a6c274780 time 2012-11-15 11:35:38.955024 mon/MonMap.h: 97: FAILED assert(addr_name.count(addr) == 0) ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: (MonMap::build_from_host_list(std::string, std::string)+0x738) [0x5988b8] 2: (MonMap::build_initial(CephContext*, std::ostream)+0x113) [0x59bd53] 3: (main()+0x12bb) [0x45ffab] 4:
Re: RBD fio Performance concerns
Which iodepth did you use for those benchs? iodepth = 100 filesize = 1G, 10G, 30G , same result (3 nodes,8 cores 2,5GHZ,32GB ram, with 6 osd each (15k drive) + journal on tmpfs) Note that I can't get more than 6000 iops on a rbd device, but with more devices it's scale. (each fio is at 6000iops) (I have same result with rbd module or with kvm guest) - Mail original - De: Sébastien Han han.sebast...@gmail.com À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-devel ceph-devel@vger.kernel.org, Mark Kampe mark.ka...@inktank.com Envoyé: Lundi 19 Novembre 2012 21:57:59 Objet: Re: RBD fio Performance concerns Which iodepth did you use for those benchs? I really don't understand why I can't get more rand read iops with 4K block ... Me neither, hope to get some clarification from the Inktank guys. It doesn't make any sense to me... -- Bien cordialement. Sébastien HAN. On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER aderum...@odiso.com wrote: @Alexandre: is it the same for you? or do you always get more IOPS with seq? rand read 4K : 6000 iops seq read 4K : 3500 iops seq read 4M : 31iops (1gigabit client bandwith limit) rand write 4k: 6000iops (tmpfs journal) seq write 4k: 1600iops seq write 4M : 31iops (1gigabit client bandwith limit) I really don't understand why I can't get more rand read iops with 4K block ... I try with high end cpu for client, it doesn't change nothing. But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is around 15% on cluster during read bench) - Mail original - De: Sébastien Han han.sebast...@gmail.com À: Mark Kampe mark.ka...@inktank.com Cc: Alexandre DERUMIER aderum...@odiso.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Lundi 19 Novembre 2012 19:03:40 Objet: Re: RBD fio Performance concerns @Sage, thanks for the info :) @Mark: If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). The original benchmark has been performed with 4M block size. And as you can see I still get more IOPS with rand than seq... I just tried with 4M without direct I/O, still the same. I can print fio results if it's needed. We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. I know why I use direct I/O. It's synthetic benchmarks, it's far away from a real life scenario and how common applications works. I just try to see the maximum I/O throughput that I can get from my RBD. All my applications use buffered I/O. @Alexandre: is it the same for you? or do you always get more IOPS with seq? Thanks to all of you.. On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe mark.ka...@inktank.com wrote: Recall: 1. RBD volumes are striped (4M wide) across RADOS objects 2. distinct writes to a single RADOS object are serialized Your sequential 4K writes are direct, depth=256, so there are (at all times) 256 writes queued to the same object. All of your writes are waiting through a very long line, which is adding horrendous latency. If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. That's correct for some of the benchmarks. However even with 4K for seq, I still get less IOPS. See below my last fio: # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 fio 1.59 Starting 4 processes Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 02m:59s] seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279