[ceph-users] how to resolve : start mon assert == 0
Hello , all when i restart any mon in mon cluster{mon.a, mon.b, mon.c} after kill all mons(disabled cephx). An exception occured as follows: # ceph-mon -i b mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread thread 7fc801c78780 time 2014-10-20 15:29:31.966367 mon/AuthMonitor.cc: 155: FAILED assert(ret == 0) ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f) 1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6] 2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347] 4: (Monitor::init_paxos()+0xf5) [0x54a515] 5: (Monitor::preinit()+0x69f) [0x56291f] 6: (main()+0x2665) [0x534df5] 7: (__libc_start_main()+0xed) [0x7fc7ffc7876d] 8: ceph-mon() [0x537bf9] Anyone can help to solve this problem? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to resolve : start mon assert == 0
Hello , all when i restart any mon in mon cluster{mon.a, mon.b, mon.c} after kill all mons(disabled cephx). An exception occured as follows: # ceph-mon -i b mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread thread 7fc801c78780 time 2014-10-20 15:29:31.966367 mon/AuthMonitor.cc: 155: FAILED assert(ret == 0) ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f) 1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6] 2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347] 4: (Monitor::init_paxos()+0xf5) [0x54a515] 5: (Monitor::preinit()+0x69f) [0x56291f] 6: (main()+0x2665) [0x534df5] 7: (__libc_start_main()+0xed) [0x7fc7ffc7876d] 8: ceph-mon() [0x537bf9] Anyone can help to solve this problem? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Same rbd mount from multiple servers
Hello, I made a 2GB RBD on Ceph and mounted on three separated servers.I followed this: http://ceph.com/docs/master/start/quick-rbd/ Set up, mkfs (extt4) and mount were finished success, but every node seems like three different rbd volume. :-o If I copy one 100 MB file on test1 node I don't see this file on test2 and test3 nodes. I'm using Ubuntu 14.04 x64 with latest stable ceph (0.80.7). What's wrong? Thank you, Mihaly http://www.virtual-call-center.eu/ http://www.virtual-call-center.hu/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Same rbd mount from multiple servers
Hi Sean, Thank you for your quick response! Okay I see, is there any preferred clustered FS in this case? OCFS2, GFS? Thanks, Mihaly 2014-10-20 10:36 GMT+02:00 Sean Redmond sean.redm...@ukfast.co.uk: Hi Mihaly, To my understanding you cannot mount an ext4 file system on more than one server at the same time, You would need to look to use a clustered file system. Thanks *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Mihály Árva-Tóth *Sent:* 20 October 2014 09:34 *To:* ceph-users@lists.ceph.com *Subject:* [ceph-users] Same rbd mount from multiple servers Hello, I made a 2GB RBD on Ceph and mounted on three separated servers.I followed this: http://ceph.com/docs/master/start/quick-rbd/ Set up, mkfs (extt4) and mount were finished success, but every node seems like three different rbd volume. :-o If I copy one 100 MB file on test1 node I don't see this file on test2 and test3 nodes. I'm using Ubuntu 14.04 x64 with latest stable ceph (0.80.7). What's wrong? Thank you, Mihaly -- NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to resolve : start mon assert == 0
Please refer to http://tracker.ceph.com/issues/8851 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of minchen Sent: Monday, October 20, 2014 3:42 PM To: ceph-users; ceph-de...@vger.kernel.org Subject: [ceph-users] how to resolve : start mon assert == 0 Hello , all when i restart any mon in mon cluster{mon.a, mon.b, mon.c} after kill all mons(disabled cephx). An exception occured as follows: # ceph-mon -i b mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread thread 7fc801c78780 time 2014-10-20 15:29:31.966367 mon/AuthMonitor.cc: 155: FAILED assert(ret == 0) ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f) 1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6] 2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347] 4: (Monitor::init_paxos()+0xf5) [0x54a515] 5: (Monitor::preinit()+0x69f) [0x56291f] 6: (main()+0x2665) [0x534df5] 7: (__libc_start_main()+0xed) [0x7fc7ffc7876d] 8: ceph-mon() [0x537bf9] Anyone can help to solve this problem? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to calculate file size when mount a block device from rbd image
Hello all, I have a question about how to calculate file size when mount a block device from rbd image . [Cluster information:] 1.The cluster with 1 mon and 6 osds. Every osd is 1T. Total spaces is 5556G. 2.rbd pool:replicated size 2 min_size 1. num = 128. Except rbd pool other pools is empty. [Steps] 1.On Linux client I use rbd command to create a 1.5T rbd image and format it with ext4. 2.Use dd command to create a 1.2T file. #dd if=/dev/zero of=/mnt/ceph-mount/test12T bs=1M count=12288000 3.When dd finished the information shows No space left on device. But parted -l display the disk space is 1611G. Why does the system show space not enough? Is there something I misunderstand or wrong? Best wishes, Mika ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] real beginner question
Hi list, This is a real newbie question.(and hopefully the right list to ask to!) Is it possible to set up ceph in an already virtualized environment? i.e. we have a scenario here, where we have virtual machine ( as opposed to individual physical machines) with ubuntu OS on it. We are trying to create ceph cluster on this virtual machine . (not sure if this is a sensible thing to do!) On our effort to install ceph we used vagrant ( came across some notes through google). We thought that would be the easiest route, as we do not know anything yet. But we are unsuccessful. We can go as far as creating a virtual machine but it fails as provisioning stage (i.e. mons;osds;mdss;rgws etc do not get created) Any suggestions? Thanks Ranju Upadhyay Maynooth University, Ireland. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to calculate file size when mount a block device from rbd image
On 10/20/2014 11:16 AM, Vickie CH wrote: Hello all, I have a question about how to calculate file size when mount a block device from rbd image . [Cluster information:] 1.The cluster with 1 mon and 6 osds. Every osd is 1T. Total spaces is 5556G. 2.rbd pool:replicated size 2 min_size 1. num = 128. Except rbd pool other pools is empty. [Steps] 1.On Linux client I use rbd command to create a 1.5T rbd image and format it with ext4. 2.Use dd command to create a 1.2T file. #dd if=/dev/zero of=/mnt/ceph-mount/test12T bs=1M count=12288000 3.When dd finished the information shows No space left on device. But parted -l display the disk space is 1611G. Why does the system show space not enough? Is there something I misunderstand or wrong? Probably the rounding of GB and GiB. Keep in mind that 1.5TiB is 1.39TB and that ext4 also eats up space and reserves 5% for the superuser. Best wishes, Mika ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] slow requests - what is causing them?
Hello cephers, I've been testing flashcache and enhanceio block device caching for the osds and i've noticed i have started getting the slow requests. The caching type that I use is ready only, so all writes bypass the caching ssds and go directly to osds, just like what it used to be before introducing the caching layer. Prior to introducing caching, i rarely had the slow requests. Judging by the logs, all slow requests are looking like these: 2014-10-16 01:09:15.600807 osd.7 192.168.168.200:6836/32031 100 : [WRN] slow request 30.999641 seconds old, received at 2014-10-16 01:08:44.601040: osd_op(client.36035566.0:16626375 rbd_data.51da686763845e .5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 2007040~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently waiting for subops from 9 2014-10-16 01:09:15.600811 osd.7 192.168.168.200:6836/32031 101 : [WRN] slow request 30.999581 seconds old, received at 2014-10-16 01:08:44.601100: osd_op(client.36035566.0:16626376 rbd_data.51da686763845e .5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 2039808~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently waiting for subops from 9 2014-10-16 01:09:16.185530 osd.2 192.168.168.200:6811/31891 76 : [WRN] 20 slow requests, 1 included below; oldest blocked for 57.003961 secs 2014-10-16 01:09:16.185564 osd.2 192.168.168.200:6811/31891 77 : [WRN] slow request 30.098574 seconds old, received at 2014-10-16 01:08:46.086854: osd_op(client.38917806.0:3481697 rbd_data.251d05e3db45a54. 0304 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 2732032~8192] 5.e4683bbb ack+ondisk+write e61892) v4 currently waiting for subops from 11 2014-10-16 01:09:16.601020 osd.7 192.168.168.200:6836/32031 102 : [WRN] 16 slow requests, 2 included below; oldest blocked for 43.531516 secs In general, I see between 0 and about 2,000 slow request log entries per day. On one day I saw over 100k entries, but it only happened once. I am struggling to understand what is casing the slow requests? If all the writes go the same path as before caching was introduced, how come I am getting them? How can I investigate this further? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reweight a host
I don't think so, check this out: # idweight type name up/down reweight -6 3.05root ssd -7 0.04999 host ceph-01-ssd 11 0.04999 osd.11 up 1 -8 1 host ceph-02-ssd 12 0.04999 osd.12 up 1 -9 1 host ceph-03-ssd 13 0.03999 osd.13 up 1 -10 1 host ceph-04-ssd 14 0.03999 osd.14 up 1 As you can see, only host ceph-01-ssd has the same weight as its osd, the other three hosts have weight 1 which is different from their associated osd. If the weight of the host -should- be the sum of all osd weights on this hosts, then my question becomes: how do I make that so for the three hosts where this is currently not the case? Thanks, Erik. On 20-10-14 03:55, Lei Dong wrote: According to my understanding, the weight of a host is the sum of all osd weights on this host. So you just reweight any osd on this host, the weight of this host is reweighed. Thanks LeiDong On 10/20/14, 7:11 AM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, Simple question: how do I reweight a host in crushmap? I can use ceph osd crush reweight to reweight an osd, but I would like to change the weight of a host instead. I tried exporting the crushmap, but I noticed that the weights of all hosts are commented out, like so: # weight 5.460 And they are not the same values as seen in ceph osd tree. So how do I keep everything as it currently it, but simply change one single weight of one single host? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reweight a host
I¹ve never see this before. The weight of host is commented out because it¹s weight is the sum of weight in the following lines started from ³item². Can you attach your crush map? Did you manually change it? On 10/20/14, 6:06 PM, Erik Logtenberg e...@logtenberg.eu wrote: I don't think so, check this out: # idweight type name up/down reweight -6 3.05root ssd -7 0.04999 host ceph-01-ssd 11 0.04999 osd.11 up 1 -8 1 host ceph-02-ssd 12 0.04999 osd.12 up 1 -9 1 host ceph-03-ssd 13 0.03999 osd.13 up 1 -10 1 host ceph-04-ssd 14 0.03999 osd.14 up 1 As you can see, only host ceph-01-ssd has the same weight as its osd, the other three hosts have weight 1 which is different from their associated osd. If the weight of the host -should- be the sum of all osd weights on this hosts, then my question becomes: how do I make that so for the three hosts where this is currently not the case? Thanks, Erik. On 20-10-14 03:55, Lei Dong wrote: According to my understanding, the weight of a host is the sum of all osd weights on this host. So you just reweight any osd on this host, the weight of this host is reweighed. Thanks LeiDong On 10/20/14, 7:11 AM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, Simple question: how do I reweight a host in crushmap? I can use ceph osd crush reweight to reweight an osd, but I would like to change the weight of a host instead. I tried exporting the crushmap, but I noticed that the weights of all hosts are commented out, like so: # weight 5.460 And they are not the same values as seen in ceph osd tree. So how do I keep everything as it currently it, but simply change one single weight of one single host? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] real beginner question
Hi Ranjan, Is it possible to set up ceph in an already virtualized environment? Yes obviously, you can try out all the features of ceph in a virtualized environment. Infact it is the easiet and recommended way of playing with Ceph. Ceph docs lists the way to do this, it should take hardly any time to get it done. http://ceph.com/docs/master/start/quick-start-preflight/ Please reach out, in case of any issues. On Mon, Oct 20, 2014 at 3:53 PM, Ranju Upadhyay ranju.upadh...@nuim.ie wrote: Hi list, This is a real newbie question.(and hopefully the right list to ask to!) Is it possible to set up ceph in an already virtualized environment? i.e. we have a scenario here, where we have virtual machine ( as opposed to individual physical machines) with ubuntu OS on it. We are trying to create ceph cluster on this virtual machine . (not sure if this is a sensible thing to do!) On our effort to install ceph we used vagrant ( came across some notes through google). We thought that would be the easiest route, as we do not know anything yet. But we are unsuccessful. We can go as far as creating a virtual machine but it fails as provisioning stage (i.e. mons;osds;mdss;rgws etc do not get created) Any suggestions? Thanks Ranju Upadhyay Maynooth University, Ireland. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- .- O -..--. ,---. .-==-. /_-\'''/-_\ / / '' \ \ |,-.| /____\ |/ o) (o \|| | ')(' | | /,'-'.\ |/ (')(') \| \ ._. / \ \/ / {_/(') (')\_} \ __ / ,-_,,,_-. '=jf=' `. _ .','--__--'. / . \/\ /'-___-'\/:|\ (_) . (_) / \ / \ (_) :| (_) \_-'--/ (_)(_) (_)___(_) |___:|| \___/ || \___/ |_| Thanks and Regards Ashish Chandra Openstack Developer, Cloud Engineering Reliance Jio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.
Test result Update: Number of Hosts Maximum single volume IOPS Maximum aggregated IOPS SSD Disk IOPS SSD Disk Utilization 7 14k45k 9800+90% 8 21k 50k 9800+90% 9 30k 56k 9800+90% 1040k 54k 8200+70% Note: the disk average request size is about 20 sectors, not same as client side (4k) I have two questions about the result: 1. No matter how many nodes the cluster has, the backend write throughput is always almost 8 times of client side. Is it normal behavior in Ceph, or caused by some wrong configuration in my setup? The following data is captured in the 9 hosts test. Roughly, the aggregated backend write throughput is 1000 * 22 * 512 * 2 * 9 = 1980M/s The client side is 56k * 4 = 244M/s Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/srops/swops/s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.330.001.33 0.0010.67 8.00 0.000.00 0.00 0.00 sdb 0.00 6.000.00 10219.67 0.00 223561.67 21.88 4.080.40 0.09 89.43 sdc 0.00 6.000.00 9750.67 0.00 220286.6722.59 2.470.25 0.09 89.83 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.000.001.33 0.0010.67 8.00 0.000.00 0.00 0.00 Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/srops/swops/s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.001.00 0.0026.6726.67 0.000.00 0.00 0.00 sdb 0.00 6.330.00 10389.00 0.00 224668.67 21.63 3.780.36 0.09 89.23 sdc 0.00 4.330.00 10106.67 0.00 217986.00 21.57 3.830.38 0.09 91.10 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.000.001.00 0.0026.6726.67 0.000.00 0.00 0.00 2. For the scalability issue ( 10 hosts performs worse than 9 hosts), is there any tuning suggestion to improve it? Thanks! 2014-10-17 16:52 GMT+08:00 Mark Wu wud...@gmail.com: I assume you added more clients and checked that it didn't scale past that? Yes, correct. You might look through the list archives; there are a number of discussions about how and how far you can scale SSD-backed cluster performance. I have look at those discussions before, particular the one initiated by Sebastien: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12486.html I found that Giant can provide better utilization on SSD backend from the thread. It does improve a lot in the test of 4k random write, compared with Firefly. In the previous tests with Firefly and 16 osds, I found that the iops of 4k random write on single volume is 14k, and which almost reach the peak of whole cluster. And the iops on SSD disk is less than 1000, which is far away from the hardware limitation. It looks that ceph doesn't dispatch fast enough. With 0.86, the following options and disabling debugging can improve obviously. throttler perf counter = false osd enable op tracker = false Just scanning through the config options you set, you might want to bump up all the filestore and journal queue values a lot farther. Tried the following options. It doesn't change. ournal_queue_max_ops=3000 objecter_inflight_ops=10240 journal_max_write_bytes=1048576000 journal_queue_max_bytes=1048576000 ms_dispatch_throttle_bytes=1048576000 objecter_infilght_op_bytes=1048576000 filestore_max_sync_interval=10 I have a question about the relationship between the write I/O numbers performed on ceph client and the osd disks. From the iostat pasted in the first message, the write per second is about 5000 and the average request size is 17~22 sectors. Roughly, the write throughtput on all osd nodes is 20 * 512 * 5000 * 30 = 1500MB/s The replica setting is 2 and the journal and osd data on the same disk, so can we assume the write on ssd disks is 40k (fio client result) * 4k * 2 * 2 = 640MB/s in theory? I don't understand why he actual write is so high compared with the theoretical value. And the average request size is also more than twice of client request size. I run blktrace to check if
[ceph-users] recovery process stops
Dear All I have in them moment a issue with my cluster. The recovery process stops. ceph -s health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%) monmap e2: 3 mons at {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election epoch 332, quorum 0,1,2 0,12,6 osdmap e6748: 24 osds: 23 up, 23 in pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%) I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster. Have someone any idea Kind Regards Harald Rößler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to calculate file size when mount a block device from rbd image
Hi Mika, 2014-10-20 11:16 GMT+02:00 Vickie CH mika.leaf...@gmail.com: 2.Use dd command to create a 1.2T file. #dd if=/dev/zero of=/mnt/ceph-mount/test12T bs=1M count=12288000 I think you're off by one zero 12288000/1024/1024 11 Means you're instructing it to create a 11TB file on a 1.5T volume. Cheers Benedikt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] recovery process stops
On 10/20/2014 02:45 PM, Harald Rößler wrote: Dear All I have in them moment a issue with my cluster. The recovery process stops. See this: 2 active+degraded+remapped+backfill_toofull 156 pgs backfill_toofull You have one or more OSDs which are to full and that causes recovery to stop. If you add more capacity to the cluster recovery will continue and finish. ceph -s health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%) monmap e2: 3 mons at {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election epoch 332, quorum 0,1,2 0,12,6 osdmap e6748: 24 osds: 23 up, 23 in pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%) I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster. Have someone any idea Kind Regards Harald Rößler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] recovery process stops
I think it's because you have too full osds like in warning message. I had similiar problem recently and i did: ceph osd reweight-by-utilization But first read what this command does. It solved problem for me. 2014-10-20 14:45 GMT+02:00 Harald Rößler harald.roess...@btd.de: Dear All I have in them moment a issue with my cluster. The recovery process stops. ceph -s health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%) monmap e2: 3 mons at {0= 10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election epoch 332, quorum 0,1,2 0,12,6 osdmap e6748: 24 osds: 23 up, 23 in pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%) I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster. Have someone any idea Kind Regards Harald Rößler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.
On 10/20/2014 06:27 AM, Mark Wu wrote: Test result Update: Number of Hosts Maximum single volume IOPS Maximum aggregated IOPS SSD Disk IOPS SSD Disk Utilization 7 14k 45k 9800+90% 8 21k 50k 9800+90% 9 30k 56k 9800+ 90% 1040k 54k 8200+70% Note: the disk average request size is about 20 sectors, not same as client side (4k) I have two questions about the result: 1. No matter how many nodes the cluster has, the backend write throughput is always almost 8 times of client side. Is it normal behavior in Ceph, or caused by some wrong configuration in my setup? Are you counting journal writes and replication into this? Also note that journal writes will be slightly larger and padded to a 4K boundary for each write due to header information. I suspect for coalesced journal writes we may be able to pack the headers together to reduce this overhead. The following data is captured in the 9 hosts test. Roughly, the aggregated backend write throughput is 1000 * 22 * 512 * 2 * 9 = 1980M/s The client side is 56k * 4 = 244M/s Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/srops/swops/s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.330.001.33 0.0010.67 8.00 0.000.00 0.00 0.00 sdb 0.00 6.000.00 10219.67 0.00 223561.67 21.88 4.080.40 0.09 89.43 sdc 0.00 6.000.00 9750.67 0.00 220286.67 22.59 2.470.25 0.09 89.83 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.000.001.33 0.0010.67 8.00 0.000.00 0.00 0.00 Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/srops/swops/s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.001.00 0.0026.67 26.67 0.000.00 0.00 0.00 sdb 0.00 6.330.00 10389.00 0.00 224668.67 21.63 3.780.36 0.09 89.23 sdc 0.00 4.330.00 10106.67 0.00 217986.00 21.57 3.830.38 0.09 91.10 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.000.001.00 0.0026.67 26.67 0.000.00 0.00 0.00 2. For the scalability issue ( 10 hosts performs worse than 9 hosts), is there any tuning suggestion to improve it? Can you post exactly the test you are running and on how many hosts/volumes? That would help us debug. Thanks! Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] why the erasure code pool not support random write?
hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] real beginner question
Hi, Ranju. Are you talking about setting up Ceph Monitors and OSD nodes on VMs for the purposes of learning, or adding a Ceph storage cluster to an existing KVM-based infrastructure that's using local storage/NFS/iSCSI for block storage now? - If the former, this is pretty easy. Although performance will suffer, OSDs and Monitors will run fine in VMs, just observe the minimum specs in the official hardware howtos. I setup my first cluster like this: Ubuntu 14.04 Workstation (with LVM) -Ceph1: 14.04 with Mon1 and OSD using Raw disk access from different LVM partitions of hypervisor OS -Ceph2: -Ceph3: -Test VM1: 14.04 desktop with 20G filesystem exposed through RBD to libvirt. What's neat (and was non-obvious) was that simply configuring the KVM hypervisor as a Ceph client allowed you to leverage its exposed storage even though the hosts exposing that storage were VMs on the same machine (horribly non-resilient design, yes, but it helped teach the concepts). - If you're looking to do the latter, you can create your Ceph cluster of nodes adjacent your existing infrastructure, configure your hypervisor nodes as ceph/rbd clients (and test them with ceph -w, etc) then convert/copy the disk images one by one to rbd block images: http://ceph.com/docs/master/rbd/libvirt/ http://ceph.com/docs/master/rbd/qemu-rbd/ Once you create a few test VMs on local disk and get into the practice of migrating them over, you'll find it's pretty straightforward with the commands listed in those pages. Dan Dan Geist dan(@)polter.net - Original Message - From: Ranju Upadhyay ranju.upadh...@nuim.ie To: ceph-users@lists.ceph.com Sent: Monday, October 20, 2014 6:23:59 AM Subject: [ceph-users] real beginner question Hi list, This is a real newbie question.(and hopefully the right list to ask to!) Is it possible to set up ceph in an already virtualized environment? i.e. we have a scenario here, where we have virtual machine ( as opposed to individual physical machines) with ubuntu OS on it. We are trying to create ceph cluster on this virtual machine . (not sure if this is a sensible thing to do!) On our effort to install ceph we used vagrant ( came across some notes through google). We thought that would be the easiest route, as we do not know anything yet. But we are unsuccessful. We can go as far as creating a virtual machine but it fails as provisioning stage (i.e. mons;osds;mdss;rgws etc do not get created) Any suggestions? Thanks Ranju Upadhyay Maynooth University, Ireland. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
This is a common constraint in many erasure coding storage system. It arises because random writes turn into a read-modify-write cycle (in order to redo the parity calculations). So we simply disallow them in EC pools, which works fine for the target use cases right now. -Greg On Monday, October 20, 2014, 池信泽 xmdx...@gmail.com wrote: hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? Thanks. -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] recovery process stops
Yes, I had some OSD which was near full, after that I tried to fix the problem with ceph osd reweight-by-utilization, but this does not help. After that I set the near full ratio to 88% with the idea that the remapping would fix the issue. Also a restart of the OSD doesn’t help. At the same time I had a hardware failure of on disk. :-(. After that failure the recovery process start at degraded ~ 13%“ and stops at 7%. Honestly I am scared in the moment I am doing the wrong operation. Regards Harald Rößler Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 02:45 PM, Harald Rößler wrote: Dear All I have in them moment a issue with my cluster. The recovery process stops. See this: 2 active+degraded+remapped+backfill_toofull 156 pgs backfill_toofull You have one or more OSDs which are to full and that causes recovery to stop. If you add more capacity to the cluster recovery will continue and finish. ceph -s health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%) monmap e2: 3 mons at {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election epoch 332, quorum 0,1,2 0,12,6 osdmap e6748: 24 osds: 23 up, 23 in pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%) I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster. Have someone any idea Kind Regards Harald Rößler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.
2014-10-20 21:04 GMT+08:00 Mark Nelson mark.nel...@inktank.com: On 10/20/2014 06:27 AM, Mark Wu wrote: Test result Update: Number of Hosts Maximum single volume IOPS Maximum aggregated IOPS SSD Disk IOPS SSD Disk Utilization 7 14k 45k 9800+ 90% 8 21k 50k 9800+90% 9 30k 56k 9800+ 90% 1040k 54k 8200+70% Note: the disk average request size is about 20 sectors, not same as client side (4k) I have two questions about the result: 1. No matter how many nodes the cluster has, the backend write throughput is always almost 8 times of client side. Is it normal behavior in Ceph, or caused by some wrong configuration in my setup? Are you counting journal writes and replication into this? Also note that journal writes will be slightly larger and padded to a 4K boundary for each write due to header information. I suspect for coalesced journal writes we may be able to pack the headers together to reduce this overhead. Yes, the journal writes and replication are counted into backend writes. Each ssd disk has two partitions: the raw one is used for journal and the one formatted as xfs is used osd data. The replica setting is 2. So considering the journal writes and replication, I expect the writes on backend is 4 times of client side. From the perspective of disk utilization, it's good because it's already close to the physical limitation. But the overhead is too big. Is it possible to try your idea without modifying code? If yes, I am glad to give it a try. The following data is captured in the 9 hosts test. Roughly, the aggregated backend write throughput is 1000 * 22 * 512 * 2 * 9 = 1980M/s The client side is 56k * 4 = 244M/s Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/srops/swops/s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.330.001.33 0.0010.67 8.00 0.000.00 0.00 0.00 sdb 0.00 6.000.00 10219.67 0.00 223561.67 21.88 4.080.40 0.09 89.43 sdc 0.00 6.000.00 9750.67 0.00 220286.67 22.59 2.470.25 0.09 89.83 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.000.001.33 0.0010.67 8.00 0.000.00 0.00 0.00 Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/srops/swops/s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.001.00 0.0026.67 26.67 0.000.00 0.00 0.00 sdb 0.00 6.330.00 10389.00 0.00 224668.67 21.63 3.780.36 0.09 89.23 sdc 0.00 4.330.00 10106.67 0.00 217986.00 21.57 3.830.38 0.09 91.10 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.000.001.00 0.0026.67 26.67 0.000.00 0.00 0.00 2. For the scalability issue ( 10 hosts performs worse than 9 hosts), is there any tuning suggestion to improve it? Can you post exactly the test you are running and on how many hosts/volumes? That would help us debug. In the test, we run vdbench with the following parameters on one host: sd=sd1,lun=/dev/rbd2,threads=128 sd=sd2,lun=/dev/rbd0,threads=128 sd=sd3,lun=/dev/rbd1,threads=128 *sd=sd4,lun=/dev/rbd3,threads=128 wd=wd1,sd=sd1,xfersize=4k,rdpct=0,openflags=o_direct wd=wd2,sd=sd2,xfersize=4k,rdpct=0,openflags=o_direct wd=wd3,sd=sd3,xfersize=4k,rdpct=0,openflags=o_direct *wd=wd4,sd=sd4,xfersize=4k,rdpct=0,openflags=o_direct rd=run1,wd=wd*,iorate=10,elapsed=500,interval=1 Thanks! Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] recovery process stops
On 10/20/2014 04:43 PM, Harald Rößler wrote: Yes, I had some OSD which was near full, after that I tried to fix the problem with ceph osd reweight-by-utilization, but this does not help. After that I set the near full ratio to 88% with the idea that the remapping would fix the issue. Also a restart of the OSD doesn’t help. At the same time I had a hardware failure of on disk. :-(. After that failure the recovery process start at degraded ~ 13%“ and stops at 7%. Honestly I am scared in the moment I am doing the wrong operation. Any chance of adding a new node with some fresh disks? Seems like you are operating on the storage capacity limit of the nodes and that your only remedy would be adding more spindles. Wido Regards Harald Rößler Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 02:45 PM, Harald Rößler wrote: Dear All I have in them moment a issue with my cluster. The recovery process stops. See this: 2 active+degraded+remapped+backfill_toofull 156 pgs backfill_toofull You have one or more OSDs which are to full and that causes recovery to stop. If you add more capacity to the cluster recovery will continue and finish. ceph -s health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%) monmap e2: 3 mons at {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election epoch 332, quorum 0,1,2 0,12,6 osdmap e6748: 24 osds: 23 up, 23 in pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%) I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster. Have someone any idea Kind Regards Harald Rößler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] recovery process stops
On 10/20/2014 05:10 PM, Harald Rößler wrote: yes, tomorrow I will get the replacement of the failed disk, to get a new node with many disk will take a few days. No other idea? If the disks are all full, then, no. Sorry to say this, but it came down to poor capacity management. Never let any disk in your cluster fill over 80% to prevent these situations. Wido Harald Rößler Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 04:43 PM, Harald Rößler wrote: Yes, I had some OSD which was near full, after that I tried to fix the problem with ceph osd reweight-by-utilization, but this does not help. After that I set the near full ratio to 88% with the idea that the remapping would fix the issue. Also a restart of the OSD doesn’t help. At the same time I had a hardware failure of on disk. :-(. After that failure the recovery process start at degraded ~ 13%“ and stops at 7%. Honestly I am scared in the moment I am doing the wrong operation. Any chance of adding a new node with some fresh disks? Seems like you are operating on the storage capacity limit of the nodes and that your only remedy would be adding more spindles. Wido Regards Harald Rößler Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 02:45 PM, Harald Rößler wrote: Dear All I have in them moment a issue with my cluster. The recovery process stops. See this: 2 active+degraded+remapped+backfill_toofull 156 pgs backfill_toofull You have one or more OSDs which are to full and that causes recovery to stop. If you add more capacity to the cluster recovery will continue and finish. ceph -s health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%) monmap e2: 3 mons at {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election epoch 332, quorum 0,1,2 0,12,6 osdmap e6748: 24 osds: 23 up, 23 in pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%) I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster. Have someone any idea Kind Regards Harald Rößler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
On 10/20/2014 03:25 PM, 池信泽 wrote: hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? To modify a EC object you need to read all chunks in order to compute the parity again. So that would involve a lot of reads for what might be just a very small write. That's also why EC can't be used for RBD images. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD very slow startup
Hi, More information on our Btrfs tests. Le 14/10/2014 19:53, Lionel Bouton a écrit : Current plan: wait at least a week to study 3.17.0 behavior and upgrade the 3.12.21 nodes to 3.17.0 if all goes well. 3.17.0 and 3.17.1 have a bug which remounts Btrfs filesystems read-only (no corruption but OSD goes down) on some access patterns with snapshots: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36483.html The bug may be present in earlier kernels (at least the 3.16.4 code in fs/btrfs/qgroup.c doesn't handle the case differently than 3.17.0 and 3.17.1) but seems at least less likely to show up (never saw it with 3.16.4 in several weeks but it happened with 3.17.1 three times in just a few hours). As far as I can tell from its Changelog, 3.17.1 didn't patch any vfs/btrfs path vs 3.17.0 so I assume 3.17.0 has the same behaviour. I switched all servers to 3.16.4 which I had previously tested without any problem. The performance problem is still there with 3.16.4. In fact one of the 2 large OSD was so slow it was repeatedly marked out and generated lots of latencies when in. I just had to remove it: when this OSD is shut down with noout to avoid backfills slowing down the storage network, latencies are back to normal. I chose to reformat this one with XFS. The other big node has a nearly perfectly identical system (same hardware, same software configuration, same logical volume configuration, same weight in the crush map, comparable disk usage in the OSD fs, ...) but is behaving itself (maybe slower than our smaller XFS and Btrfs OSD, but usable). The only notable difference is that it was formatted more recently. So the performance problem might be linked to the cumulative amount of data access to the OSD over time. If my suspicion is true I believe we might see performance problems on the other Btrfs OSDs later (we'll have to wait). Is any Btrfs developper subscribed to this list? I could forward this information to linux-btrfs@vger if needed but I can't offer much debugging help (the storage cluster is in production and I'm more inclined to migrate slow OSDs to XFS than doing invasive debugging with Btrfs). Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] recovery process stops
Yes I agree 100%, but actual every disk have a maximum of 86% usage, there should a way to recover the cluster. To set the near full ratio to higher than 85% should be only a short term solution. New disk for higher capacity are already ordered, I only don’t like degraded situation, for a week or more. Also one of the VM’s doesn’t start because an slow request warning. Thanks for your advise. Harald Rößler Am 20.10.2014 um 17:12 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 05:10 PM, Harald Rößler wrote: yes, tomorrow I will get the replacement of the failed disk, to get a new node with many disk will take a few days. No other idea? If the disks are all full, then, no. Sorry to say this, but it came down to poor capacity management. Never let any disk in your cluster fill over 80% to prevent these situations. Wido Harald Rößler Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 04:43 PM, Harald Rößler wrote: Yes, I had some OSD which was near full, after that I tried to fix the problem with ceph osd reweight-by-utilization, but this does not help. After that I set the near full ratio to 88% with the idea that the remapping would fix the issue. Also a restart of the OSD doesn’t help. At the same time I had a hardware failure of on disk. :-(. After that failure the recovery process start at degraded ~ 13%“ and stops at 7%. Honestly I am scared in the moment I am doing the wrong operation. Any chance of adding a new node with some fresh disks? Seems like you are operating on the storage capacity limit of the nodes and that your only remedy would be adding more spindles. Wido Regards Harald Rößler Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 02:45 PM, Harald Rößler wrote: Dear All I have in them moment a issue with my cluster. The recovery process stops. See this: 2 active+degraded+remapped+backfill_toofull 156 pgs backfill_toofull You have one or more OSDs which are to full and that causes recovery to stop. If you add more capacity to the cluster recovery will continue and finish. ceph -s health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%) monmap e2: 3 mons at {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election epoch 332, quorum 0,1,2 0,12,6 osdmap e6748: 24 osds: 23 up, 23 in pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%) I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster. Have someone any idea Kind Regards Harald Rößler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.
On 10/20/2014 09:28 AM, Mark Wu wrote: 2014-10-20 21:04 GMT+08:00 Mark Nelson mark.nel...@inktank.com mailto:mark.nel...@inktank.com: On 10/20/2014 06:27 AM, Mark Wu wrote: Test result Update: Number of Hosts Maximum single volume IOPS Maximum aggregated IOPS SSD Disk IOPS SSD Disk Utilization 7 14k 45k 9800+ 90% 8 21k 50k 9800+ 90% 9 30k 56k 9800+ 90% 1040k 54k 8200+ 70% Note: the disk average request size is about 20 sectors, not same as client side (4k) I have two questions about the result: 1. No matter how many nodes the cluster has, the backend write throughput is always almost 8 times of client side. Is it normal behavior in Ceph, or caused by some wrong configuration in my setup? Are you counting journal writes and replication into this? Also note that journal writes will be slightly larger and padded to a 4K boundary for each write due to header information. I suspect for coalesced journal writes we may be able to pack the headers together to reduce this overhead. Yes, the journal writes and replication are counted into backend writes. Each ssd disk has two partitions: the raw one is used for journal and the one formatted as xfs is used osd data. The replica setting is 2. So considering the journal writes and replication, I expect the writes on backend is 4 times of client side. From the perspective of disk utilization, it's good because it's already close to the physical limitation. But the overhead is too big. Is it possible to try your idea without modifying code? If yes, I am glad to give it a try. Sadly it will require code changes and is something we've only briefly talked about. So it is surprising that you would see 8x writes with 2x replication and on-disk journals imho. In the past one of the things I've done is add up all of the totals for the entire test both on the client side and on the server side just to make sure that the numbers are right. At least in past testing things properly added up, at least on our test rig. The following data is captured in the 9 hosts test. Roughly, the aggregated backend write throughput is 1000 * 22 * 512 * 2 * 9 = 1980M/s The client side is 56k * 4 = 244M/s Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/srops/swops/s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.330.001.33 0.0010.67 8.00 0.000.00 0.00 0.00 sdb 0.00 6.000.00 10219.67 0.00 223561.67 21.88 4.080.40 0.09 89.43 sdc 0.00 6.000.00 9750.67 0.00 220286.67 22.59 2.470.25 0.09 89.83 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.000.001.33 0.0010.67 8.00 0.000.00 0.00 0.00 Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/srops/swops/s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.001.00 0.0026.67 26.67 0.000.00 0.00 0.00 sdb 0.00 6.330.00 10389.00 0.00 224668.67 21.63 3.780.36 0.09 89.23 sdc 0.00 4.330.00 10106.67 0.00 217986.00 21.57 3.830.38 0.09 91.10 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.000.001.00 0.0026.67 26.67 0.000.00 0.00 0.00 2. For the scalability issue ( 10 hosts performs worse than 9 hosts), is there any tuning suggestion to improve it? Can you post exactly the test you are running and on how many hosts/volumes? That would help us debug. In the test, we run vdbench with the following parameters on one host: sd=sd1,lun=/dev/rbd2,threads=128 sd=sd2,lun=/dev/rbd0,threads=128 sd=sd3,lun=/dev/rbd1,threads=128 *sd=sd4,lun=/dev/rbd3,threads=128 wd=wd1,sd=sd1,xfersize=4k,rdpct=0,openflags=o_direct
[ceph-users] RADOS pool snaps and RBD
Hi, It seems Ceph doesn't allow rados pool snapshots on RBD pools which have or had RBD snapshots. They only work on RBD pools which never had a RBD snapshot. So, basically this works: rados mkpool test-pool 1024 1024 replicated rbd -p test-pool create --size=102400 test-image ceph osd pool mksnap test-pool rados-snap But this doesn't: rados mkpool test-pool 1024 1024 replicated rbd -p test-pool create --size=102400 test-image rbd -p test-pool snap create test-image@rbd-snap ceph osd pool mksnap test-pool rados-snap And we get the following error message: Error EINVAL: pool test-pool is in unmanaged snaps mode I've been checking the source code and it seems to be the expecte behavior, but I did not manage to find any information regarding unmanaged snaps mode. Also I did not find any information about RBD snapshots and pool snapshots being mutually exclusive. And even deleting all the RBD snapshots in a pool doesn't enable RADOS snapshots again. So, I have a couple of questions: - Are RBD and RADOS snapshots mutually exclusive? - What does mean unmanaged snaps mode message? - Is there any way to revert a pool status to allow RADOS pool snapshots after all RBD snapshots are removed? We are designing a quite interesting way to perform incremental backups of RBD pools managed by OpenStack Cinder. The idea is to do the incremental backup at a RADOS level, basically using the mtime property of the object and comparing it against the time we did the last backup / pool snapshot. That way it should be really easy to find modified objects transferring only them, making the implementation of a DR solution easier.. But the issue explained here would be a big problem, as the backup solution would stop working if just one user creates a RBD snapshot on the pool (For example using Cinder Backup). I hope somebody could give us more information about this unmanaged snaps mode or point us to a way to revert this behavior once all RBD snapshots have been removed from a pool. Thanks! Saludos cordiales, Xavier Trilla P. Silicon Hosting ¿Sabías que ahora en SiliconHosting resolvemos tus dudas técnicas gratis? Más información en: siliconhosting.com/qa/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
Le 20/10/2014 16:39, Wido den Hollander a écrit : On 10/20/2014 03:25 PM, 池信泽 wrote: hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? To modify a EC object you need to read all chunks in order to compute the parity again. So that would involve a lot of reads for what might be just a very small write. That's also why EC can't be used for RBD images. I'm surprised this is a show stopper. Even if writes are really slow, I can see several uses case for RBD images on EC pools (archiving, template RDBs, ...). Using tier caching in a write-back configuration might even alleviate some of the performance problems if writes from the cache pool are done on properly aligned and sized chunks of data. It may be overly optimistic (the small benchmark on the following page might be done with all planets aligned...) but Sheepdog seems to implement EC storage with what would be interesting for me if I could get equivalent performance on purely sequential accesses with a theoretical Ceph EC RBDs. https://github.com/sheepdog/sheepdog/wiki/Erasure-Code-Support#performance Lionel Bouotn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] recovery process stops
You can set lower weight on full osds, or try changing the osd_near_full_ratio parameter in your cluster from 85 to for example 89. But i don't know what can go wrong when you do that. 2014-10-20 17:12 GMT+02:00 Wido den Hollander w...@42on.com: On 10/20/2014 05:10 PM, Harald Rößler wrote: yes, tomorrow I will get the replacement of the failed disk, to get a new node with many disk will take a few days. No other idea? If the disks are all full, then, no. Sorry to say this, but it came down to poor capacity management. Never let any disk in your cluster fill over 80% to prevent these situations. Wido Harald Rößler Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 04:43 PM, Harald Rößler wrote: Yes, I had some OSD which was near full, after that I tried to fix the problem with ceph osd reweight-by-utilization, but this does not help. After that I set the near full ratio to 88% with the idea that the remapping would fix the issue. Also a restart of the OSD doesn’t help. At the same time I had a hardware failure of on disk. :-(. After that failure the recovery process start at degraded ~ 13%“ and stops at 7%. Honestly I am scared in the moment I am doing the wrong operation. Any chance of adding a new node with some fresh disks? Seems like you are operating on the storage capacity limit of the nodes and that your only remedy would be adding more spindles. Wido Regards Harald Rößler Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 02:45 PM, Harald Rößler wrote: Dear All I have in them moment a issue with my cluster. The recovery process stops. See this: 2 active+degraded+remapped+backfill_toofull 156 pgs backfill_toofull You have one or more OSDs which are to full and that causes recovery to stop. If you add more capacity to the cluster recovery will continue and finish. ceph -s health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%) monmap e2: 3 mons at {0= 10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election epoch 332, quorum 0,1,2 0,12,6 osdmap e6748: 24 osds: 23 up, 23 in pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%) I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster. Have someone any idea Kind Regards Harald Rößler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 2014-10-20 17:12 GMT+02:00 Wido den Hollander w...@42on.com: On 10/20/2014 05:10 PM, Harald Rößler wrote: yes, tomorrow I will get the replacement of the failed disk, to get a new node with many disk will take a few days. No other idea? If the disks are all full, then, no. Sorry to say this, but it came down to poor capacity management. Never let any disk in your cluster fill over 80% to prevent these situations. Wido Harald Rößler Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 04:43 PM, Harald Rößler wrote: Yes, I had some OSD which was near full, after that I tried to fix the problem with ceph osd reweight-by-utilization, but this does not help. After that I set the near full ratio to 88% with the idea that the remapping would fix the issue. Also a restart of the OSD doesn’t help. At the same time I
Re: [ceph-users] Ceph OSD very slow startup
On Mon, Oct 20, 2014 at 8:25 AM, Lionel Bouton lionel+c...@bouton.name wrote: Hi, More information on our Btrfs tests. Le 14/10/2014 19:53, Lionel Bouton a écrit : Current plan: wait at least a week to study 3.17.0 behavior and upgrade the 3.12.21 nodes to 3.17.0 if all goes well. 3.17.0 and 3.17.1 have a bug which remounts Btrfs filesystems read-only (no corruption but OSD goes down) on some access patterns with snapshots: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36483.html The bug may be present in earlier kernels (at least the 3.16.4 code in fs/btrfs/qgroup.c doesn't handle the case differently than 3.17.0 and 3.17.1) but seems at least less likely to show up (never saw it with 3.16.4 in several weeks but it happened with 3.17.1 three times in just a few hours). As far as I can tell from its Changelog, 3.17.1 didn't patch any vfs/btrfs path vs 3.17.0 so I assume 3.17.0 has the same behaviour. I switched all servers to 3.16.4 which I had previously tested without any problem. The performance problem is still there with 3.16.4. In fact one of the 2 large OSD was so slow it was repeatedly marked out and generated lots of latencies when in. I just had to remove it: when this OSD is shut down with noout to avoid backfills slowing down the storage network, latencies are back to normal. I chose to reformat this one with XFS. The other big node has a nearly perfectly identical system (same hardware, same software configuration, same logical volume configuration, same weight in the crush map, comparable disk usage in the OSD fs, ...) but is behaving itself (maybe slower than our smaller XFS and Btrfs OSD, but usable). The only notable difference is that it was formatted more recently. So the performance problem might be linked to the cumulative amount of data access to the OSD over time. Yeah; we've seen this before and it appears to be related to our aggressive use of btrfs snapshots; it seems that btrfs doesn't defrag well under our use case. The btrfs developers make sporadic concerted efforts to improve things (and succeed!), but it apparently still hasn't gotten enough better yet. :( -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph RBD
On Mon, 20 Oct 2014, Dianis Dimoglo wrote: I installed ceph two nodes, 2 mon 2 osd in xfs, also used the RBD and mount the pool on two different ceph host and when I write data through one of the hosts at the other I do not see the data, what's wrong? Although the RBD disk can be shared, that will only be useful if the file system you put on top is designed to allow that. The usual suspects (ext4, xfs, etc.) do not--they assume only a single host is using the disk at any time. That means that unless you deploy a cluster fs like ocfs2 or gfs2, you can only use an RBD on a single host at a time. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] recovery process stops
yes, tomorrow I will get the replacement of the failed disk, to get a new node with many disk will take a few days. No other idea? Harald Rößler Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 04:43 PM, Harald Rößler wrote: Yes, I had some OSD which was near full, after that I tried to fix the problem with ceph osd reweight-by-utilization, but this does not help. After that I set the near full ratio to 88% with the idea that the remapping would fix the issue. Also a restart of the OSD doesn’t help. At the same time I had a hardware failure of on disk. :-(. After that failure the recovery process start at degraded ~ 13%“ and stops at 7%. Honestly I am scared in the moment I am doing the wrong operation. Any chance of adding a new node with some fresh disks? Seems like you are operating on the storage capacity limit of the nodes and that your only remedy would be adding more spindles. Wido Regards Harald Rößler Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com: On 10/20/2014 02:45 PM, Harald Rößler wrote: Dear All I have in them moment a issue with my cluster. The recovery process stops. See this: 2 active+degraded+remapped+backfill_toofull 156 pgs backfill_toofull You have one or more OSDs which are to full and that causes recovery to stop. If you add more capacity to the cluster recovery will continue and finish. ceph -s health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%) monmap e2: 3 mons at {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election epoch 332, quorum 0,1,2 0,12,6 osdmap e6748: 24 osds: 23 up, 23 in pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%) I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster. Have someone any idea Kind Regards Harald Rößler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph RBD
Sage, Even with cluster file system, it will still need a fencing mechanism to allow SCSI device shared by multiple host, what kind of SCSI reservation RBD currently support? Fred Sent from my Samsung Galaxy S3 On Oct 20, 2014 4:42 PM, Sage Weil s...@newdream.net wrote: On Mon, 20 Oct 2014, Dianis Dimoglo wrote: I installed ceph two nodes, 2 mon 2 osd in xfs, also used the RBD and mount the pool on two different ceph host and when I write data through one of the hosts at the other I do not see the data, what's wrong? Although the RBD disk can be shared, that will only be useful if the file system you put on top is designed to allow that. The usual suspects (ext4, xfs, etc.) do not--they assume only a single host is using the disk at any time. That means that unless you deploy a cluster fs like ocfs2 or gfs2, you can only use an RBD on a single host at a time. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
Thanks. Another reason is the checksum in the attr of object used for deep scrub in EC pools should be computed when modify the object. When supporting the random write, We should caculate the whole object for checksum, even if there is a bit modified. If only supporting append write, We can get the checksum based on the previously checksum and the append date which is more quickly. Am I right? 2014-10-21 0:36 GMT+08:00 Gregory Farnum g...@inktank.com: This is a common constraint in many erasure coding storage system. It arises because random writes turn into a read-modify-write cycle (in order to redo the parity calculations). So we simply disallow them in EC pools, which works fine for the target use cases right now. -Greg On Monday, October 20, 2014, 池信泽 xmdx...@gmail.com wrote: hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? Thanks. -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
Hi, Le 21/10/2014 01:10, 池信泽 a écrit : Thanks. Another reason is the checksum in the attr of object used for deep scrub in EC pools should be computed when modify the object. When supporting the random write, We should caculate the whole object for checksum, even if there is a bit modified. If only supporting append write, We can get the checksum based on the previously checksum and the append date which is more quickly. Am I right? From what I understand, the deep scrub doesn't use a Ceph checksum but compares data between OSDs (and probably use a majority wins rule for repair). If you are using Btrfs it will report an I/O error because it uses an internal checksum by default which will force Ceph to use other OSDs for repair. I'd be glad to be proven wrong on this subject though. Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD (and probably other settings) not being picked up outside of the [global] section
I'm still running Emperor, but I'm not seeing that behavior. My ceph.conf is pretty similar: [global] mon initial members = ceph0 mon host = 10.129.0.6:6789, 10.129.0.7:6789, 10.129.0.8:6789 cluster network = 10.130.0.0/16 osd pool default flag hashpspool = true osd pool default min size = 2 osd pool default size = 3 public network = 10.129.0.0/16 [osd] osd journal size = 6144 osd mkfs options xfs = -s size=4096 osd mkfs type = xfs osd mount options xfs = rw,noatime,nodiratime,nosuid,noexec,inode64 If you manually run ceph-disk-prepare and ceph-disk-activate, are the mkfs params being picked up? For the daemon configs, you can query a running daemon to see what it's config params are: root@ceph0:~# ceph daemon osd.0 config get 'osd_op_threads' { osd_op_threads: 2} root@ceph0:~# ceph daemon osd.0 config get 'osd_scrub_load_threshold' { osd_scrub_load_threshold: 0.5} While we try to figure this out, you can tell the running daemons to use your values with: ceph tell osd.\* --inject_args '--osd_op_threads 10' On Thu, Oct 16, 2014 at 6:54 PM, Christian Balzer ch...@gol.com wrote: Hello, Consider this rather basic configuration file: --- [global] fsid = e6687ef7-54e1-44bd-8072-f9ecab00815 mon_initial_members = ceph-01, comp-01, comp-02 mon_host = 10.0.0.21,10.0.0.5,10.0.0.6 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true mon_osd_downout_subtree_limit = host public_network = 10.0.0.0/8 osd_pool_default_pg_num = 2048 osd_pool_default_pgp_num = 2048 osd_crush_chooseleaf_type = 1 [osd] osd_mkfs_type = ext4 osd_mkfs_options_ext4 = -J size=1024 -E lazy_itable_init=0,lazy_journal_init=0 osd_op_threads = 10 osd_scrub_load_threshold = 2.5 filestore_max_sync_interval = 10 --- Let us slide the annoying fact that ceph ignores the pg and pgp settings when creating the initial pools. And that monitors are preferred based on IP address instead of the sequence they're listed in the config file. Interestingly ceph-deploy correctly picks up the mkfs_options but why it fails to choose the mkfs_type as default is beyond me. The real issue is that the other three OSD setting are NOT picked up by ceph on startup. But they sure are when moved to the global section. Anybody else seeing this (both with 0.80.1 and 0.80.6)? Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] urgent- object unfound
It's probably a bit late now, but did you get the issue resolved? If not, why is OSD.49 down? I'd start by trying to get all of your OSDs back UP and IN. It may take a little while to unblock the requests. Recovery doesn't appear to prioritize blocked PGs, so it might take a while for recovery to get to the PG you care about. PGs tracks which OSDs were the primary over time (see ceph pg 6.766 query) . If your OSDs are flapping, it's possible for OSD.49 to have the most recent version of object1, and OSD.21 to have the most recent version of object2. Ceph can repair this, but it needs all of the PGs with the most recent version to be UP, and it blocks until the current primary has the latest version of the object requested. On Thu, Oct 16, 2014 at 5:36 AM, Ta Ba Tuan tuant...@vccorp.vn wrote: Hi eveyone, I use replicate 3, many unfound object and Ceph very slow. pg 6.9d8 is active+recovery_wait+degraded+remapped, acting [22,93], 4 unfound pg 6.766 is active+recovery_wait+degraded+remapped, acting [21,36], 1 unfound pg 6.73f is active+recovery_wait+degraded+remapped, acting [19,84], 2 unfound pg 6.63c is active+recovery_wait+degraded+remapped, acting [10,37], 2 unfound pg 6.56c is active+recovery_wait+degraded+remapped, acting [124,93], 2 unfound pg 6.4d3 is active+recovering+degraded+remapped, acting [33,94], 2 unfound pg 6.4a5 is active+recovery_wait+degraded+remapped, acting [11,94], 2 unfound pg 6.2f9 is active+recovery_wait+degraded+remapped, acting [22,34], 2 unfound recovery 535673/52672768 objects degraded (1.017%); 17/17470639 unfound (0.000%) ceph pg map 6.766 osdmap e94990 pg 6.766 (6.766) - up [49,36,21] acting [21,36] I can't resolve it. I need data on those objects. Guide me, please! Thank you! -- Tuan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph RBD
Hi Fred, There is a fencing mechanism. There is work underway to wire it up to an iSCSI target (LIO in this case), but I think that isn't needed to simply run ocfs2 (or similar) directly on top of an RBD device. Honestly I'm not quite sure how that would glue together. sage On Mon, 20 Oct 2014, Fred Yang wrote: Sage, Even with cluster file system, it will still need a fencing mechanism to allow SCSI device shared by multiple host, what kind of SCSI reservation RBD currently support? Fred Sent from my Samsung Galaxy S3 On Oct 20, 2014 4:42 PM, Sage Weil s...@newdream.net wrote: On Mon, 20 Oct 2014, Dianis Dimoglo wrote: I installed ceph two nodes, 2 mon 2 osd in xfs, also used the RBD and mount the pool on two different ceph host and when I write data through one of the hosts at the other I do not see the data, what's wrong? Although the RBD disk can be shared, that will only be useful if the file system you put on top is designed to allow that. The usual suspects (ext4, xfs, etc.) do not--they assume only a single host is using the disk at any time. That means that unless you deploy a cluster fs like ocfs2 or gfs2, you can only use an RBD on a single host at a time. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Use case: one-way RADOS replication between two clusters by time period
RadosGW Federation can fulfill this use case: http://ceph.com/docs/master/radosgw/federated-config/ . Depending on your setup, it may or may not be easily. To start, radosgw-agent handles the replication. It does the metadata (users and bucket) and the data (objects in a bucket). It only flows from the primary to the secondary, so you're good there. It tracks what's been replicated, and maintains this state in (I believe) the secondary cluster. If replication is started up after being down, it starts from the last replication timestamp, and runs up to now (whatever is now when the run starts). Objects that have been deleted and garbage collected in the primary won't replicate, but it won't cause the replication to fail. The currently version of radosgw-agent, 1.2, attempts to get everything from it's last replication timestamp to current in a single pass. It doesn't persist it's replication state until it finishes that pass. Because of this, any interruption of the replication will start over. This is really only a problem if you have large buckets. If you have many bucket with a small amount of data, you'll just want to run a lot of replication threads. I have a few buckets, with ~1M objects and ~1 TiB of data per bucket. Took me a while to figure out that nightly log rotation was restarting the daemon. Once I disable log rotation, I ran into problems with the stability of my VPN connection. It's definitely do-able. I would setup some virtual test clusters and try it out. On Thu, Oct 16, 2014 at 2:05 AM, Anthony Alba ascanio.al...@gmail.com wrote: Hi list, Can RADOS fulfil the following use case: I wish to have a radosgw-S3 object store that is LIVE, this represents current objects of users. Separated by an air-gap is another radosgw-S3 object store that is ARCHIVE. The objects will only be created and manipulated by radosgw. Periodically, (on the order of 3-6 months), I want to connect the two clusters and replicate all objects from LIVE to ARCHIVE created from time period DDMM1 - DDMM2 or better yet from the last timestamp . This is a one way replication and the objects are transferred only in the LIVE == ARCHIVE direction. Can this be done easily? Thanks Anthony ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD (and probably other settings) not being picked up outside of the [global] section
Hello, On Mon, 20 Oct 2014 17:09:57 -0700 Craig Lewis wrote: I'm still running Emperor, but I'm not seeing that behavior. My ceph.conf is pretty similar: Yeah, I tested things extensively with Emperor back in the day and at that time frequently verified that changes in the config file were reflected in the running configuration after a restart. Until last week I of course blissfully assumed that this basic functionality would still work in Firefly. ^o^ [global] mon initial members = ceph0 mon host = 10.129.0.6:6789, 10.129.0.7:6789, 10.129.0.8:6789 cluster network = 10.130.0.0/16 osd pool default flag hashpspool = true osd pool default min size = 2 osd pool default size = 3 public network = 10.129.0.0/16 [osd] osd journal size = 6144 osd mkfs options xfs = -s size=4096 osd mkfs type = xfs osd mount options xfs = rw,noatime,nodiratime,nosuid,noexec,inode64 If you manually run ceph-disk-prepare and ceph-disk-activate, are the mkfs params being picked up? No idea really, I will have to test that. Of course with ceph-deploy (and I assume ceph-disk-prepare) the activate bit is a bit of misnomer, as the udev magic will happily activate an OSD instantly after creation despite me using just ceph-deploy osd prepare For the daemon configs, you can query a running daemon to see what it's config params are: root@ceph0:~# ceph daemon osd.0 config get 'osd_op_threads' { osd_op_threads: 2} root@ceph0:~# ceph daemon osd.0 config get 'osd_scrub_load_threshold' { osd_scrub_load_threshold: 0.5} I of course know that, that is how I found out that things didn't get picked up. While we try to figure this out, you can tell the running daemons to use your values with: ceph tell osd.\* --inject_args '--osd_op_threads 10' That I'm also aware of, but for the time being having everything in [global] resolves the problem and more importantly makes it reboot proof. Christian On Thu, Oct 16, 2014 at 6:54 PM, Christian Balzer ch...@gol.com wrote: Hello, Consider this rather basic configuration file: --- [global] fsid = e6687ef7-54e1-44bd-8072-f9ecab00815 mon_initial_members = ceph-01, comp-01, comp-02 mon_host = 10.0.0.21,10.0.0.5,10.0.0.6 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true mon_osd_downout_subtree_limit = host public_network = 10.0.0.0/8 osd_pool_default_pg_num = 2048 osd_pool_default_pgp_num = 2048 osd_crush_chooseleaf_type = 1 [osd] osd_mkfs_type = ext4 osd_mkfs_options_ext4 = -J size=1024 -E lazy_itable_init=0,lazy_journal_init=0 osd_op_threads = 10 osd_scrub_load_threshold = 2.5 filestore_max_sync_interval = 10 --- Let us slide the annoying fact that ceph ignores the pg and pgp settings when creating the initial pools. And that monitors are preferred based on IP address instead of the sequence they're listed in the config file. Interestingly ceph-deploy correctly picks up the mkfs_options but why it fails to choose the mkfs_type as default is beyond me. The real issue is that the other three OSD setting are NOT picked up by ceph on startup. But they sure are when moved to the global section. Anybody else seeing this (both with 0.80.1 and 0.80.6)? Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to resolve : start mon assert == 0
thank you very much, Shu, Xinxin I just start all mons , with command ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0 on each mon node Min Chen 在2014-10-20,Shu, Xinxin xinxin@intel.com 写道:-原始邮件- 发件人: Shu, Xinxin xinxin@intel.com 发送时间: 2014年10月20日 星期一 收件人: minchen minc...@ubuntukylin.com, ceph-users ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org ceph-de...@vger.kernel.org 主题: RE: [ceph-users] how to resolve : start mon assert == 0 Please refer to http://tracker.ceph.com/issues/8851 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of minchen Sent: Monday, October 20, 2014 3:42 PM To: ceph-users; ceph-de...@vger.kernel.org Subject: [ceph-users] how to resolve : start mon assert == 0 Hello , all when i restart any mon in mon cluster{mon.a, mon.b, mon.c} after kill all mons(disabled cephx). An exception occured as follows: # ceph-mon -i b mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread thread 7fc801c78780 time 2014-10-20 15:29:31.966367 mon/AuthMonitor.cc: 155: FAILED assert(ret == 0) ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f) 1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6] 2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347] 4: (Monitor::init_paxos()+0xf5) [0x54a515] 5: (Monitor::preinit()+0x69f) [0x56291f] 6: (main()+0x2665) [0x534df5] 7: (__libc_start_main()+0xed) [0x7fc7ffc7876d] 8: ceph-mon() [0x537bf9] Anyone can help to solve this problem? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RADOS pool snaps and RBD
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Xavier Trilla Sent: Tuesday, October 21, 2014 12:42 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] RADOS pool snaps and RBD Hi, It seems Ceph doesn't allow rados pool snapshots on RBD pools which have or had RBD snapshots. They only work on RBD pools which never had a RBD snapshot. So, basically this works: rados mkpool test-pool 1024 1024 replicated rbd -p test-pool create --size=102400 test-image ceph osd pool mksnap test-pool rados-snap But this doesn't: rados mkpool test-pool 1024 1024 replicated rbd -p test-pool create --size=102400 test-image rbd -p test-pool snap create test-image@rbd-snap ceph osd pool mksnap test-pool rados-snap And we get the following error message: Error EINVAL: pool test-pool is in unmanaged snaps mode I've been checking the source code and it seems to be the expecte behavior, but I did not manage to find any information regarding unmanaged snaps mode. Also I did not find any information about RBD snapshots and pool snapshots being mutually exclusive. And even deleting all the RBD snapshots in a pool doesn't enable RADOS snapshots again. So, I have a couple of questions: - Are RBD and RADOS snapshots mutually exclusive? I think the answer is yes , this will be checked before you get your snap_seq in OSDMonitor. - What does mean unmanaged snaps mode message? This means you have create a rbd snapshot , you cannot create a snapshot for rados - Is there any way to revert a pool status to allow RADOS pool snapshots after all RBD snapshots are removed? not very sure We are designing a quite interesting way to perform incremental backups of RBD pools managed by OpenStack Cinder. The idea is to do the incremental backup at a RADOS level, basically using the mtime property of the object and comparing it against the time we did the last backup / pool snapshot. That way it should be really easy to find modified objects transferring only them, making the implementation of a DR solution easier.. But the issue explained here would be a big problem, as the backup solution would stop working if just one user creates a RBD snapshot on the pool (For example using Cinder Backup). I hope somebody could give us more information about this unmanaged snaps mode or point us to a way to revert this behavior once all RBD snapshots have been removed from a pool. Thanks! Saludos cordiales, Xavier Trilla P. Silicon Hosting ¿Sabías que ahora en SiliconHosting resolvemos tus dudas técnicas gratis? Más información en: siliconhosting.com/qa/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosGW balancer best practices
I'm using my existing HAProxy server to also balance my RadosGW nodes. I'm not going to run into bandwidth problems on that link any time soon, but I'll split RadosGW off onto it's own HAProxy instance when it does become congested. I have a smaller cluster, only 5 nodes. I'm running mon on the first 3 nodes, and osd and rgw on all 5 nodes. rgw and mon have very little overhead. I don't plan put those services on dedicated nodes anytime soon. MONs need a decent amount of Disk I/O; bad things happen if the disks can't keep up with the MONMAP updates. Virtual machines with dedicated IOPS should work fine for them. RadosGW doesn't need much CPU or Disk I/O. I would have no problem testing virtual machines as RadosGW nodes. As far as few-and-big or many-and-small, my gut feeling is that Apache FastCGI isn't going to scale up to 10 GigE speeds. Obviously you should test this. I don't see much downside to going many-and-small, unless you're planning to go crazy with the many. But that also depends on your network, and how the GigE and 10 GigE networks switch/route. If you have some spare CPU on your MON machines, I see no reason they can't double as RadosGW nodes too. On Tue, Oct 14, 2014 at 11:41 AM, Simone Spinelli simone.spine...@unipi.it wrote: Dear all, we are going to add rados-gw to our ceph cluster (144 OSD on 12 servers + 3 monitors connected via 10giga network) and we have a couple of questions. The first question is about the load balancer, do you have some advice based on real-world experience? Second question is about the number of gateway instances: is it better to have many littlegiga-connected servers or less fat10giga-connected servers considering that the total bandwidth available is 10 giga anyway? Do you use real or virtual servers? Any advice in terms of performances and reliability? Many thanks! Simone -- Simone Spinelli simone.spine...@unipi.it Università di Pisa Direzione ICT - Servizi di Rete ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Use case: one-way RADOS replication between two clusters by time period
Great information, thanks. I would like to confirm that if I regularly delete older buckets off the LIVE primary system, the extra objects on the ARCHIVE secondaries are ignored during replication. I.e. it does not behave like rsync -avz --delete LIVE/ ARCHIVE/ Rather it behaves more like rsync -avz LIVE/ ARCHIVE/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph counters
I've just started on this myself.. I started with https://ceph.com/docs/v0.80/dev/perf_counters/ I'm currently monitoring the latency, using the (to pick one example) [op_w_latency][sum] and [op_w_latency][avgcount]. Both values are counters, so they only increase with time. The lifetime average latency of the cluster isn't verify useful, so I track the deltas of those values, then divide the recent deltas to get the average latency over my sample period. Just graphing the latencies let me see a spike in write latency on all disks on one node, which eventually led me to a dead write-cache battery. That's for the OSDs. I have similar things setup for MON and RadosGW. I'm sure there are many more useful things to graph. One of things I'm interested in (but haven't found time to research yet) is the journal usage, with maybe some alerts if the journal is more than 90% full. On Mon, Oct 13, 2014 at 2:57 PM, Jakes John jakesjohn12...@gmail.com wrote: Bump:). It would be helpful, if someone can share info related to debugging using counters/stats On Sun, Oct 12, 2014 at 7:42 PM, Jakes John jakesjohn12...@gmail.com wrote: Hi All, I would like to know if there are useful performance counters in ceph which can help to debug the cluster. I have seen hundreds of stat counters in various daemon dumps. Some of them are, 1. commit_latency_ms 2. apply_latency_ms 3. snap_trim_queue_len 4. num_snap_trimming What do these indicate?. . I have used iostat, atop for cluster statistics but, none of them indicate the internal ceph status. Machines might be new but, osds can still be slow. If some of these counters can help to debug why certain osds are bad( or can get bad later), it would be great. Some counters like total processed requests, pending requests in queue, avg time taken to process a request etc ? Are there any docs for all performance counters which I can read?. I couldn't find anything in ceph docs. Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph counters
On 10/20/2014 08:22 PM, Craig Lewis wrote: I've just started on this myself.. I started with https://ceph.com/docs/v0.80/dev/perf_counters/ I'm currently monitoring the latency, using the (to pick one example) [op_w_latency][sum] and [op_w_latency][avgcount]. Both values are counters, so they only increase with time. The lifetime average latency of the cluster isn't verify useful, so I track the deltas of those values, then divide the recent deltas to get the average latency over my sample period. Just graphing the latencies let me see a spike in write latency on all disks on one node, which eventually led me to a dead write-cache battery. That's for the OSDs. I have similar things setup for MON and RadosGW. I'm sure there are many more useful things to graph. One of things I'm interested in (but haven't found time to research yet) is the journal usage, with maybe some alerts if the journal is more than 90% full. This is not likely to be an issue with the default journal config since the wbthrottle code is pretty aggressive about flushing the journal to avoid spiky client IO. Having said that, I tend to agree that we need to do a better job of documenting everything from the perf counters to the states described in dump_historic_ops. Even internally it can get confusing trying to keep track of what's going on where. Mark On Mon, Oct 13, 2014 at 2:57 PM, Jakes John jakesjohn12...@gmail.com mailto:jakesjohn12...@gmail.com wrote: Bump:). It would be helpful, if someone can share info related to debugging using counters/stats On Sun, Oct 12, 2014 at 7:42 PM, Jakes John jakesjohn12...@gmail.com mailto:jakesjohn12...@gmail.com wrote: Hi All, I would like to know if there are useful performance counters in ceph which can help to debug the cluster. I have seen hundreds of stat counters in various daemon dumps. Some of them are, 1. commit_latency_ms 2. apply_latency_ms 3. snap_trim_queue_len 4. num_snap_trimming What do these indicate?. . I have used iostat, atop for cluster statistics but, none of them indicate the internal ceph status. Machines might be new but, osds can still be slow. If some of these counters can help to debug why certain osds are bad( or can get bad later), it would be great. Some counters like total processed requests, pending requests in queue, avg time taken to process a request etc ? Are there any docs for all performance counters which I can read?. I couldn't find anything in ceph docs. Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Use case: one-way RADOS replication between two clusters by time period
In a normal setup, where radosgw-agent runs all the time, it will delete the objects and buckets fairly quickly after they're deleted in the primary zone. If you shut down radosgw-agent, then nothing will update in the secondary cluster. Once you re-enable radosgw-agent, it will eventually process the deletes (along with all the writes). radosgw-agent is a relatively straight-forward python script. It shouldn't be too difficult to ignore the deletes, or write them to a database and process them 6 months later. I'm working on some snapshot capabilities for RadosGW ( https://wiki.ceph.com/Planning/Blueprints/Hammer/rgw%3A_Snapshots). Even if I (or my code) does something really stupid, I'll be able to go back and read the deleted objects from the snapshots. It's not perfect, it won't protect against malicious actions, but it will give me a safety net. On Mon, Oct 20, 2014 at 6:18 PM, Anthony Alba ascanio.al...@gmail.com wrote: Great information, thanks. I would like to confirm that if I regularly delete older buckets off the LIVE primary system, the extra objects on the ARCHIVE secondaries are ignored during replication. I.e. it does not behave like rsync -avz --delete LIVE/ ARCHIVE/ Rather it behaves more like rsync -avz LIVE/ ARCHIVE/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph counters
I'm sure there are many more useful things to graph. One of things I'm interested in (but haven't found time to research yet) is the journal usage, with maybe some alerts if the journal is more than 90% full. This is not likely to be an issue with the default journal config since the wbthrottle code is pretty aggressive about flushing the journal to avoid spiky client IO. Having said that, I tend to agree that we need to do a better job of documenting everything from the perf counters to the states described in dump_historic_ops. Even internally it can get confusing trying to keep track of what's going on where. Mark I've always had issues during deep-scrubbing, particularly when there is a lot of deep-scrubbing going on for a long time. For example, I left nodeep-scrub set for a month. Things were pretty painful when I unset it. Everything was fine, but after ~8 hours, I start getting slow requests, then osds marked down for being unresponsive. So full journals is just my most recent theory. I haven't figured out how to test my theory. I've tested (and fixed) a lot of other issues, which have made things better. It less of a problem now with journals on SSD, but it's something I ran into a several times when my journals were on the HDD. With with the SSD journals, if I do something that affects ~20% of my OSDs, I start having issues. I only have 5 nodes, and I can trigger this by re-formatting all of the OSDs on one node. I haven't (yet) had problems with smaller operations that affect less than 5% of my OSDs. My disk are 4TB, ~70% full, and a fresh format takes 24-48 hours to backfill. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RADOS pool snaps and RBD
On Mon, 20 Oct 2014, Xavier Trilla wrote: Hi, It seems Ceph doesn't allow rados pool snapshots on RBD pools which have or had RBD snapshots. They only work on RBD pools which never had a RBD snapshot. So, basically this works: rados mkpool test-pool 1024 1024 replicated rbd -p test-pool create --size=102400 test-image ceph osd pool mksnap test-pool rados-snap But this doesn't: rados mkpool test-pool 1024 1024 replicated rbd -p test-pool create --size=102400 test-image rbd -p test-pool snap create test-image@rbd-snap ceph osd pool mksnap test-pool rados-snap And we get the following error message: Error EINVAL: pool test-pool is in unmanaged snaps mode I've been checking the source code and it seems to be the expecte behavior, but I did not manage to find any information regarding unmanaged snaps mode. Also I did not find any information about RBD snapshots and pool snapshots being mutually exclusive. And even deleting all the RBD snapshots in a pool doesn't enable RADOS snapshots again. So, I have a couple of questions: - Are RBD and RADOS snapshots mutually exclusive? Xinxin already mentioned this, but to confirm, yes. - What does mean unmanaged snaps mode message? It means the librados user is manaing its own snapshot metadata. I this case, that's RBD; it stores information about what snapshots apply to what images in the RBD header object. - Is there any way to revert a pool status to allow RADOS pool snapshots after all RBD snapshots are removed? No. We are designing a quite interesting way to perform incremental backups of RBD pools managed by OpenStack Cinder. The idea is to do the incremental backup at a RADOS level, basically using the mtime property of the object and comparing it against the time we did the last backup / pool snapshot. That way it should be really easy to find modified objects transferring only them, making the implementation of a DR solution easier.. But the issue explained here would be a big problem, as the backup solution would stop working if just one user creates a RBD snapshot on the pool (For example using Cinder Backup). This is already possible using the diff-export and diff-import functions of RBD on a per-image granularity. I think the only thing it doesn't provide is the ability to build a consistency group of lots of images and snapshot them together. Note also that listing all objects to find the changed ones is not very efficient. The export-diff function is currnetly also not very efficient (it enumerates image objects), but the 'object map' changes that Jason is working on for RBD will fix this and make it quite fast. sage I hope somebody could give us more information about this unmanaged snaps mode or point us to a way to revert this behavior once all RBD snapshots have been removed from a pool. Thanks! Saludos cordiales, Xavier Trilla P. Silicon Hosting ?Sab?as que ahora en SiliconHosting resolvemos tus dudas t?cnicas gratis? M?s informaci?n en: siliconhosting.com/qa/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com