Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes
Thanks, Robert, for sharing so many experience! I feel like I don't deserve it :) I have another but very same situation which I don't understand. Last time i tried to hard kill OSD daemons. This time i add a new node with 2 OSDs to my cluster and also monitor the IO. I wrote a script which adds a node with OSDs fully automatically. And seems like when I start the script - an IO is also blocked until the cluster shows HEALTH_OK which takes quite an amount of time. After Ceph status is OK - copying resumes. What should I tune this time to avoid long IO interuption? Thanks in advance again :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Can you provide the output of the CRUSH map and a copy of the script that you are using to add the OSDs? Can you also provide the pool size and pool min_size? -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVVMLvCRDmVDuy+mK58QAAHVIQALIZ8aOWE5P8DkRe+8pz XS+rMdA17nPUd2mX6PIqhjBxetrUhIjQUho8HSIswT9JVkjVSIj+QHs5CI1C 6ArWIPt/U8L78d1hI8NuH/vWwWydYfV32n2L2LExIgUpFAbJA81AnjjDFLvo T63KLitQ1wz8lyhAWXp4ze15CgAv1u9VbJhazeeWunxZxd8eSGuUS8RTdhLD sD0pSQnVT4W04TSKYfvbUlpqm68wGY+MApnuQXdpC0jBLcDz0OSu1P+OQC03 0vBCERY1er/rSskJ6TRrQGLzXAc/vc3HbPMvegIhp2voeXgONdO5P/qLfSfD ZwVUoi6EfFe+na3S4rEjOeBU+v2P00komVEcvjOJDQb3IVcE23iVJOezk3p+ AgJqOz9VLdGvdmZTZnR08PKPZEja80QzrSklRW5f8JyjKlbE8tB5lBoM5mKo oRcBSDbGSKvXInqygQ3XLdxULHaXbNqNPj+JvPbmfkTU6Iq6pXqcBdUSqG0o /5Rx16+2Rouz4f8uu5irmDjz0ivKL6QCIzBwZbBTdLIwqhf9vCl1ACDWq4U3 DMorcafZbMArdOqlkVhQJiMioZEQ8U/ThY2bInkNdhii/2A35CToyOfMKyfq FLAK5lCiM6gRfCkEBPTwkDR6GNAfgY7khz34adsBRlZPB6a3MeucAGtTjyWt AJIV =bcYd -END PGP SIGNATURE- Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, May 14, 2015 at 6:33 AM, Vasiliy Angapov anga...@gmail.com wrote: Thanks, Robert, for sharing so many experience! I feel like I don't deserve it :) I have another but very same situation which I don't understand. Last time i tried to hard kill OSD daemons. This time i add a new node with 2 OSDs to my cluster and also monitor the IO. I wrote a script which adds a node with OSDs fully automatically. And seems like when I start the script - an IO is also blocked until the cluster shows HEALTH_OK which takes quite an amount of time. After Ceph status is OK - copying resumes. What should I tune this time to avoid long IO interuption? Thanks in advance again :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cisco UCS Blades as MONs? Pros cons ...?
I have 42 OSDs on 6 servers. I'm planning to double that this quarter by adding 6 more servers to get to 84 OSDs. I have 3 monitor VMs. Two of them are running on two different blades in the same chassis, but their networking is on different fabrics. The third one is on a blade in a different chassis. My monitor VM cpu, memory and disk io load is very small, as in nearly idle. The VM images are on local 10k disks on the blade. They share the disks with a few other low IO VMs. I've read that the monitors can get busy and need a lot of IO, where it justifies using SSDs. I imagine those must be very large clusters with at least hundreds of OSDs. Jake On Wednesday, May 13, 2015, Götz Reinicke - IT Koordinator goetz.reini...@filmakademie.de wrote: Hi Jake, we have the fabric interconnects. MONs as VM? What setup do you have? and what cluster size? Regards . Götz Am 13.05.15 um 15:20 schrieb Jake Young: I run my mons as VMs inside of UCS blade compute nodes. Do you use the fabric interconnects or the standalone blade chassis? Jake On Wednesday, May 13, 2015, Götz Reinicke - IT Koordinator goetz.reini...@filmakademie.de javascript:; mailto: goetz.reini...@filmakademie.de javascript:; wrote: Hi Christian, currently we do get good discounts as an University and the bundles were worth it. The chassis do have multiple PSUs and n 10Gb Ports (40Gb is possible). The switch connection is redundant. Cuurrently we think of 10 SATA OSD nodes + x SSD Cache Pool Nodes and 5 MONs. For a start. The main focus with the blaids would be spacesaving in the rack. Till now I dont have any prize, but that woucld count to in our decision :) Thanks and regards . Götz ... -- Götz Reinicke IT-Koordinator Tel. +49 7141 969 82 420 E-Mail goetz.reini...@filmakademie.de javascript:; Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg www.filmakademie.de Eintragung Amtsgericht Stuttgart HRB 205016 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg Geschäftsführer: Prof. Thomas Schadt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)
On 14/05/2015 18:15, Francois Lafont wrote: Hi, I had a problem with a cephfs freeze in a client. Impossible to re-enable the mountpoint. A simple ls /mnt command totally blocked (of course impossible to umount-remount etc.) and I had to reboot the host. But even a normal reboot didn't work, the host didn't stop. I had to do a hard reboot of the host. In brief, it was like a big NFS freeze. ;) Greg's response is pretty comprehensive, but for completeness I'll add that the specific case of shutdown blocking is http://tracker.ceph.com/issues/9477 Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)
Hi, I had a problem with a cephfs freeze in a client. Impossible to re-enable the mountpoint. A simple ls /mnt command totally blocked (of course impossible to umount-remount etc.) and I had to reboot the host. But even a normal reboot didn't work, the host didn't stop. I had to do a hard reboot of the host. In brief, it was like a big NFS freeze. ;) In the logs, nothing relevant in the client side and just this line in the cluster side: ~# cat /var/log/ceph/ceph-mds.1.log [...] 2015-05-14 17:07:17.259866 7f3b5cffc700 0 log_channel(cluster) log [INF] : closing stale session client.1342358 192.168.21.207:0/519924348 after 301.329013 [...] And indeed, the freeze was probably triggered by a little network interruption. Here is my configuration: - OS: Ubuntu 14.04 in the client and in the cluster nodes. - Kernel: 3.16.0-36-generic in the client and in the cluster nodes. (apt-get install linux-image-generic-lts-utopic). - Ceph version: Hammer in the client and in cluster nodes (0.94.1-1trusty). In the client, I use the cephfs kernel module (not ceph-fuse). Here is the fstab line in the client node: 10.0.2.150,10.0.2.151,10.0.2.152:/ /mnt ceph noatime,noacl,name=cephfs,secretfile=/etc/ceph/secret,_netdev 0 0 My only configuration concerning mds in ceph.conf is just: mds cache size = 100 That's all. Here are my questions: 1. Is this kind of freeze normal? Can I avoid these freezes with a more recent version of the kernel in the client? 2. Can I avoid these freezes with ceph-fuse instead of the kernel cephfs module? But in this case, the cephfs performance will be worse. Am I wrong? 3. Is there a parameter in ceph.conf to tell mds to be more patient before closing the stale session of a client? I'm in a testing period and a hard reboot of my cephfs clients would be quite annoying for me. Thanks in advance for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)
On Thu, May 14, 2015 at 10:15 AM, Francois Lafont flafdiv...@free.fr wrote: Hi, I had a problem with a cephfs freeze in a client. Impossible to re-enable the mountpoint. A simple ls /mnt command totally blocked (of course impossible to umount-remount etc.) and I had to reboot the host. But even a normal reboot didn't work, the host didn't stop. I had to do a hard reboot of the host. In brief, it was like a big NFS freeze. ;) In the logs, nothing relevant in the client side and just this line in the cluster side: ~# cat /var/log/ceph/ceph-mds.1.log [...] 2015-05-14 17:07:17.259866 7f3b5cffc700 0 log_channel(cluster) log [INF] : closing stale session client.1342358 192.168.21.207:0/519924348 after 301.329013 [...] And indeed, the freeze was probably triggered by a little network interruption. Here is my configuration: - OS: Ubuntu 14.04 in the client and in the cluster nodes. - Kernel: 3.16.0-36-generic in the client and in the cluster nodes. (apt-get install linux-image-generic-lts-utopic). - Ceph version: Hammer in the client and in cluster nodes (0.94.1-1trusty). In the client, I use the cephfs kernel module (not ceph-fuse). Here is the fstab line in the client node: 10.0.2.150,10.0.2.151,10.0.2.152:/ /mnt ceph noatime,noacl,name=cephfs,secretfile=/etc/ceph/secret,_netdev 0 0 My only configuration concerning mds in ceph.conf is just: mds cache size = 100 That's all. Here are my questions: 1. Is this kind of freeze normal? Can I avoid these freezes with a more recent version of the kernel in the client? Yes, it's normal. Although you should have been able to do a lazy and/or force umount. :) You can't avoid the freeze with a newer client. :( If you notice the problem quickly enough, you should be able to reconnect everything by rebooting the MDS — although if the MDS hasn't failed the client then things shouldn't be blocking, so actually that probably won't help you. 2. Can I avoid these freezes with ceph-fuse instead of the kernel cephfs module? But in this case, the cephfs performance will be worse. Am I wrong? No, ceph-fuse will suffer the same blockage, although obviously in userspace it's a bit easier to clean up. Depending on your workload it will be slightly faster to a lot slower. Though you'll also get updates faster/more easily. ;) 3. Is there a parameter in ceph.conf to tell mds to be more patient before closing the stale session of a client? Yes. You'll need to increase the mds session timeout value on the MDS; it currently defaults to 60 seconds. You can increase that to whatever values you like. The tradeoff here is that if you have a client die, anything it had capabilities' on (for read/write access) will be unavailable for anybody who's doing something that might conflict with those capabilities. If you've got a new enough MDS (Hammer, probably, but you can check) then you can use the admin socket to boot specific sessions, so it may suit you to set very large timeouts and manually zap any client which actually goes away badly (rather than getting disconnected by the network). I'm in a testing period and a hard reboot of my cephfs clients would be quite annoying for me. Thanks in advance for your help. Yeah. Unfortunately there's a basic tradeoff in strictly-consistent (aka POSIX) network filesystems here: if the network goes away, you can't be consistent any more because the disconnected client can make conflicting changes. And you can't tell exactly when the network disappeared. So while we hope to make this less painful in the future, the network dying that badly is a failure case that you need to be aware of meaning that the client might have conflicting information. If it *does* have conflicting info, the best we can do about it is be polite, return a bunch of error codes, and unmount gracefully. We'll get there eventually but it's a lot of work. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] export-diff exported only 4kb instead of 200-600gb
Interesting. The 'rbd diff' operation uses the same librbd API method as 'rbd export-diff' to calculate all the updated image extents, so it's very strange that one works and the other doesn't given that you have a validly formatted export. I tried to recreate your issues on Giant and was unable to recreate it. I would normally ask for a log dump with 'debug rbd = 20', but given the size of your image, that log will be astronomically large. -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Ultral ultral...@gmail.com To: Jason Dillaman dilla...@redhat.com Cc: ceph-users ceph-us...@ceph.com Sent: Tuesday, May 12, 2015 12:15:27 PM Subject: Re: [ceph-users] export-diff exported only 4kb instead of 200-600gb If you run 'rbd info --pool RBD-01 CEPH_006__01__NA__0003__ ESX__ALL_EXT', what is the output? size 2048 GB in 524288 objects order 22 (4096 kB objects) block_name_prefix: rb.0.19b1.238e1f29 format: 1 Does 'rbd diff' work against the image (i.e. more than a few kilobyes of deltas)? it looks fine time rbd diff --cluster cluster1 --pool NETAP-RBD-01 CEPH_006__01__NA__0003__ESX__ALL_EXT |wc -c 14593264 real 22m35.316s user 2m39.537s sys 1m24.177s Also, would it be possible for you to create a new, test image in the same pool, snapshot it, use 'rbd bench-write' to generate some data, and then verify if export-diff is properly working against the new image? i will try.. i can create only 1-100gb image in this pool 2015-05-12 19:30 GMT+05:00 Jason Dillaman dilla...@redhat.com : Very strange. I'll see if I can reproduce on a giant release. If you run 'rbd info --pool RBD-01 CEPH_006__01__NA__0003__ESX__ALL_EXT', what is the output? I want to use the same settings as your image. Does 'rbd diff' work against the image (i.e. more than a few kilobyes of deltas)? Also, would it be possible for you to create a new, test image in the same pool, snapshot it, use 'rbd bench-write' to generate some data, and then verify if export-diff is properly working against the new image? -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Ultral ultral...@gmail.com To: Jason Dillaman dilla...@redhat.com Cc: ceph-users ceph-us...@ceph.com Sent: Sunday, May 10, 2015 5:40:00 AM Subject: Re: [ceph-users] export-diff exported only 4kb instead of 200-600gb Hello Jason, but to me it sounds like you are saying that there are no/minimal deltas between snapshots move2db24-20150428 and 2015-05-05 (both from the export-diff and from your clone). yep, it correct. difference between snapshots move2db24-20150428 2015-05-05 is too small 4kb instead of 200-800gb.. Are you certain that you made 700-800GBs of changes between the two snapshots and no trim operations released your changes back? VM locate on the image, it is intranet for 1000 peoples. it has web+mysql+sphinx+backups vast majority of changed data are backups(2 day rotation) inside VM on the image. it made about 200gb of data each day also we store users uploads (0.3-3gb per day) and databases (about 30gb ) so i suppose that changes should be more than 4kb If you diff from move2db24-20150428 to HEAD, do you see all your changes? rbd export-diff --cluster cluster1 --pool RBD-01 CEPH_006__01__NA__0003__ESX__ALL_EXT --from-snap move2db24-20150428 -|wc -c 6786 Exporting image: 100% complete...done. it is too small.. i've added some video files to the VM, however it shows only 6kb 2015-05-08 18:43 GMT+05:00 Jason Dillaman dilla...@redhat.com : There is probably something that I am not understanding, but to me it sounds like you are saying that there are no/minimal deltas between snapshots move2db24-20150428 and 2015-05-05 (both from the export-diff and from your clone). Are you certain that you made 700-800GBs of changes between the two snapshots and no trim operations released your changes back? If you diff from move2db24-20150428 to HEAD, do you see all your changes? -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Ultral ultral...@gmail.com To: ceph-users ceph-us...@ceph.com Sent: Thursday, May 7, 2015 11:45:46 AM Subject: [ceph-users] export-diff exported only 4kb instead of 200-600gb Hi all, Something strange occurred. I have ceph 0.87 version and 2048gb image format 1. I decided to made incremental backups between clusters i've made initial copy, time bbcp -x 7M -P 3600 -w 32M -s 6 -Z 5030:5035 -N io rbd export-diff --cluster cluster1 --pool RBD-01 --image CEPH_006__01__NA__0003__ESX__ALL_EXT --snap move2db24-20150428 - 1.1.1.1 :rbd import-diff - --cluster cluster2 --pool TST-INT-SD-RBD-1DC --image temp and decide to move incremental(it should be about 200-600gb of changes) time bbcp -c -x 7M -P 3600 -w 32M -s 6 -Z 5030:5035 -N io rbd --cluster
Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)
On Thu, May 14, 2015 at 2:47 PM, John Spray john.sp...@redhat.com wrote: Greg's response is pretty comprehensive, but for completeness I'll add that the specific case of shutdown blocking is http://tracker.ceph.com/issues/9477 I've seen the same thing before with /dev/rbd mounts when the network temporarily goes away - client had to be rebooted. Is this likely to be the same underlying issue? Lee ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados cppool
On 2015-05-14 21:04:06 +, Daniel Schneller said: On 2015-04-23 19:39:33 +, Sage Weil said: On Thu, 23 Apr 2015, Pavel V. Kaygorodov wrote: Hi! I have copied two of my pools recently, because old ones has too many pgs. Both of them contains RBD images, with 1GB and ~30GB of data. Both pools was copied without errors, RBD images are mountable and seems to be fine. CEPH version is 0.94.1 You will likely have problems if you try to delete snapshots that existed on the images (snaps are not copied/preserved by cppool). sage Could you be more specific on what these problems would look like? Are you referring to RBD pools in particular, or is this a general issue with snapshots? Anything that could be done to prevent these issues? Background of the question is that we take daily snapshots of some pools to allow reverting data when users make mistakes (via RGW). So it would be difficult to get rid of all snapshots first. Thanks Daniel Never mind, found more information on this on the list a few posts later. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph -w output
Hi! I am trying to get behind the values in ceph -w, especially those regarding throughput(?) at the end: 2015-05-15 00:54:33.333500 mon.0 [INF] pgmap v26048646: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 6023 kB/s rd, 549 kB/s wr, 7564 op/s 2015-05-15 00:54:34.339739 mon.0 [INF] pgmap v26048647: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 1853 kB/s rd, 1014 kB/s wr, 2015 op/s 2015-05-15 00:54:35.353621 mon.0 [INF] pgmap v26048648: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 2101 kB/s rd, 1680 kB/s wr, 1950 op/s 2015-05-15 00:54:36.375887 mon.0 [INF] pgmap v26048649: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 1641 kB/s rd, 1266 kB/s wr, 1710 op/s 2015-05-15 00:54:37.399647 mon.0 [INF] pgmap v26048650: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 4735 kB/s rd, 777 kB/s wr, 7088 op/s 2015-05-15 00:54:38.453922 mon.0 [INF] pgmap v26048651: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 5176 kB/s rd, 942 kB/s wr, 7779 op/s 2015-05-15 00:54:39.462838 mon.0 [INF] pgmap v26048652: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 3407 kB/s rd, 768 kB/s wr, 2131 op/s 2015-05-15 00:54:40.488387 mon.0 [INF] pgmap v26048653: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 3343 kB/s rd, 518 kB/s wr, 1881 op/s 2015-05-15 00:54:41.512540 mon.0 [INF] pgmap v26048654: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 1221 kB/s rd, 2385 kB/s wr, 1686 op/s Am I right to assume the values for kB/s rd and kB/s wr mean that the indicated amount of data has been read/written by clients since the last line, total over all OSDs? As for the op/s I am a little more uncertain. What kind of operations does this count? Assuming it is also reads and writes aggregated, what counts as an operation? For example, when I request data via the Rados Gateway, do I see one op here for the request from RGW's perspective, or do I see multiple, depending on how many low level objects a big RGW upload was striped to? What about non-rgw objects that get striped? Are reads/writes on those counted as one or one per stripe? Is there anything else counting into this but reads/writes to the object data? What about key/value level accesses? Is it possible to someone come up with a theoretical estimate for a maximum value achievable with a given set of hardware? This is a cluster of 4 nodes with 48 OSDs, 4TB each, all spinners. Are these values good, bad, critical? Can I somehow deduce - even if it is just a rather rough estimate - how loaded my cluster is? I am not talking about precision monitoring, but some kind of traffic light system (e.g. up to X% of the theoretical max is fine, up to Y% show a very busy cluster and anything above Y% means we might be up for trouble)? Any pointers to documentation or other material would be appreciated if this was discussed in some detail before. The only thing I found was a post on this list from 2013 which did not say more than ops are reads, writes, anything, not going into detail about the anything. Thanks a lot! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados cppool
On 2015-04-23 19:39:33 +, Sage Weil said: On Thu, 23 Apr 2015, Pavel V. Kaygorodov wrote: Hi! I have copied two of my pools recently, because old ones has too many pgs. Both of them contains RBD images, with 1GB and ~30GB of data. Both pools was copied without errors, RBD images are mountable and seems to be fine. CEPH version is 0.94.1 You will likely have problems if you try to delete snapshots that existed on the images (snaps are not copied/preserved by cppool). sage Could you be more specific on what these problems would look like? Are you referring to RBD pools in particular, or is this a general issue with snapshots? Anything that could be done to prevent these issues? Background of the question is that we take daily snapshots of some pools to allow reverting data when users make mistakes (via RGW). So it would be difficult to get rid of all snapshots first. Thanks Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Firefly to Hammer
You should be able to do just that. We recently upgraded from Firefly to Hammer like that. Follow the order described on the website. Monitors, OSDs, MDSs. Notice that the Debian packages do not restart running daemons, but they _do_ start up not running ones. So say for some reason before your upgrade you shut down OSDs, they would be started as part of the upgrade. Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-deploy osd activate ERROR
Hi , I encountered other problems when i installed ceph . #1. When i run the command , ceph-deploy new ceph-0, and got the ceph.conf file . However , there is not any information aboutosd pool default size or public network . [root@ceph-2 my-cluster]# more ceph.conf [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx mon_host = 192.168.72.33 mon_initial_members = ceph-0 fsid = 74d682b5-2bf2-464c-8462-740f96bcc525 #2. I ignore the problem #1 , and continue to set us the Ceph Storage Cluster , encountered a error , whhen run the command ' ceph-deploy osd activate ceph-2:/mnt/sda ' . I do it refer to the manual , http://ceph.com/docs/master/start/quick-ceph-deploy/ error message [root@ceph-0 my-cluster]#ceph-deploy osd prepare ceph-2:/mnt/sda [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.23): /usr/bin/ceph-deploy osd prepare ceph-2:/mnt/sda [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks ceph-2:/mnt/sda: [ceph-2][DEBUG ] connected to host: ceph-2 [ceph-2][DEBUG ] detect platform information from remote host [ceph-2][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: CentOS 6.5 Final [ceph_deploy.osd][DEBUG ] Deploying osd to ceph-2 [ceph-2][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [ceph-2][INFO ] Running command: udevadm trigger --subsystem-match=block --action=add [ceph_deploy.osd][DEBUG ] Preparing host ceph-2 disk /mnt/sda journal None activate False [ceph-2][INFO ] Running command: ceph-disk -v prepare --fs-type xfs --cluster ceph -- /mnt/sda [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_dmcrypt_type [ceph-2][WARNIN] DEBUG:ceph-disk:Preparing osd data dir /mnt/sda [ceph-2][INFO ] checking OSD status... [ceph-2][INFO ] Running command: ceph --cluster=ceph osd stat --format=json [ceph_deploy.osd][DEBUG ] Host ceph-2 is now ready for osd use. Error in sys.exitfunc: [root@ceph-0 my-cluster]# ceph-deploy osd activate ceph-2:/mnt/sda [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.23): /usr/bin/ceph-deploy osd activate ceph-2:/mnt/sda [ceph_deploy.osd][DEBUG ] Activating cluster ceph disks ceph-2:/mnt/sda: [ceph-2][DEBUG ] connected to host: ceph-2 [ceph-2][DEBUG ] detect platform information from remote host [ceph-2][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: CentOS 6.5 Final [ceph_deploy.osd][DEBUG ] activating host ceph-2 disk /mnt/sda [ceph_deploy.osd][DEBUG ] will use init type: sysvinit [ceph-2][INFO ] Running command: ceph-disk -v activate --mark-init sysvinit --mount /mnt/sda [ceph-2][WARNIN] DEBUG:ceph-disk:Cluster uuid is af23707d-325f-4846-bba9-b88ec953be80 [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid [ceph-2][WARNIN] DEBUG:ceph-disk:Cluster name is ceph [ceph-2][WARNIN] DEBUG:ceph-disk:OSD uuid is ca9f6649-b4b8-46ce-a860-1d81eed4fd5e [ceph-2][WARNIN] DEBUG:ceph-disk:Allocating OSD id... [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ ceph.keyring osd create --concise ca9f6649-b4b8-46ce-a860-1d81eed4fd5e [ceph-2][WARNIN] 2015-05-14 17:37:10.988914 7f373bd34700 0 librados: client.bootstrap-osd authentication error (1) Operation not permitted [ceph-2][WARNIN] Error connecting to cluster: PermissionError [ceph-2][WARNIN] ceph-disk: Error: ceph osd create failed: Command '/usr/bin/ceph' returned non-zero exit status 1: [ceph-2][ERROR ] RuntimeError: command returned non-zero exit status: 1 [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk -v activate --mark-init sysvinit --mount /mnt/sda Error in sys.exitfunc: I look forward to hearing from you soon.
Re: [ceph-users] How to debug a ceph read performance problem?
Hi, 1. The network problem has been partly resovled, we removed bonding of Juno node (Ceph client side), and now IO comes back: [root@controller fio-rbd]# rados bench -p test 30 seq sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 176 160 639.7 640 0.186673 0.0933836 2 16 339 323 645.795 652 0.079945 0.0965533 3 16 509 493 657.153 680 0.06882 0.0957288 4 16 672 656 655.837 652 0.068071 0.0963944 5 16 828 812649.45 624 0.061999 0.0975488 6 16 989 973 648.513 644 0.110632 0.0979637 7 16 1139 1123 641.565 600 0.078144 0.0983299 8 16 1295 1279 639.349 624 0.243684 0.0991592 9 16 1453 1437 638.522 632 0.08775 0.0993148 10 16 1580 1564 625.461 508 0.061375 0.101921 The bonding is constructed by interface em1 and em2, the problematic interface is em2. Traffic from some stroage nodes to em2 are quite well, but some are not good. Still don't know the exact issue at the moment, but it is surely a network problem. 2. About monitors: - monitors has not been restarted for at least half a year. - ceph tell mon.bj-ceph14 compact just stuck until I Ctrl+C, the same for other monitor nodes. - /var/lib/ceph/mon share Linux system disk (RAID1 of two HDD) I will go through google and mail-list later. 3. About memory, yes, I made things wrong. Will spend some time on atop :-) 4. Single CPU with 4 core, without hyperthread. So, CPU need to be upgraded, and OSD number per Ceph node should be reduced (spare some CPU power for more SSD), and add more SSD journal disk. Also, I am planning to upgrade OSD data disk from 1TB to 4TB. I will look through mails about ratio between OSD data disk and journal disk, and space and performance requirement for journal SSD. 2015-05-14 10:39 GMT+08:00 Christian Balzer ch...@gol.com: Hello, On Thu, 14 May 2015 09:36:14 +0800 changqian zuo wrote: 1. No packet drop found in system log. Is that storage node with the bad network fixed? 2. ceph health detail shows: # ceph health detail HEALTH_WARN mon.bj-ceph10 addr 10.10.11.23:6789/0 has 43% avail disk space -- store is getting too big! 77364 MB = 40960 MB mon.bj-ceph12 addr 10.10.11.25:6789/0 has 43% avail disk space -- store is getting too big! 77071 MB = 40960 MB mon.bj-ceph13 addr 10.10.11.26:6789/0 has 42% avail disk space -- store is getting too big! 78403 MB = 40960 MB mon.bj-ceph14 addr 10.10.11.27:6789/0 has 43% avail disk space -- store is getting too big! 78006 MB = 40960 MB I am checking out what does this mean exactly. You will find a lot of answers looking for compact mon storage, including a very recent thread here. In short, I suppose those monitors have not been re-started for a long time, right? Also, you have a pretty big cluster, so this isn't all that surprising. I'd suggest to do a ceph tell mon.mon.bj-ceph14 compact and if that works out well, repeat with the others. Are your MONs using SSDs, for /var/lib/ceph in particular? 3. by run out, I mean: # free -m total used free sharedbuffers cached Mem: 64376 63766609 0123 47974 -/+ buffers/cache: 15669 48707 Swap:22831 2319 20512 That doesn't look too bad, only 16MB used by processes, the rest is cache and friends. However during recovery this usage will get higher and Ceph also benefits from large pagecaches for hot object reads. So doubling the memory might be a good long term goal. top shows memory are mainly used by ceph-osd process. You might want to spent some time learning atop, as it will show you what is going on in every part of your system (huge terminal window helps). 4. Cluster configuration, for a single Ceph node CPU: Intel(R) Xeon(R) CPU E5-2603 v2 @ 1.80GHz Single CPU??? That's optimistically 8GHz of CPU, when the recommendation for purely HDD based OSDs is 1GHz per OSD. Since you're using SSD journals, you will want double that to not be CPU limited for many use cases. memory: 64GB Data disk: 1TB HDD * 24 (do not know vendor now) Journal disk: 800TB SSD * 2 (do not know vendor now) We have run 24 OSDs on one node! I think this is why the memory in shortage (also CPU may not afford in highload recovery or reblance, and 2 SSD for 24 OSD journal is just not enough) and slow OSD write logged, if reduced to 16 or 12, it would be much better. You're likely to be CPU bound a lot of times during normal operations (provided everything else in your cluster is working correctly). Depending on the type of SSD they should be fast enough to
Re: [ceph-users] Find out the location of OSD Journal
I tend to use something along the lines for osd in $(grep osd /etc/mtab | cut -d ' ' -f 2); do echo $(echo $osd | cut -d '-' -f 2): $(readlink -f $(readlink $osd/journal));done | sort -k 2 Cheers, Josef On 08 May 2015, at 02:47, Robert LeBlanc rob...@leblancnet.us wrote: You may also be able to use `ceph-disk list`. On Thu, May 7, 2015 at 3:56 AM, Francois Lafont flafdiv...@free.fr mailto:flafdiv...@free.fr wrote: Hi, Patrik Plank wrote: i cant remember on which drive I install which OSD journal :-|| Is there any command to show this? It's probably not the answer you hope, but why don't use a simple: ls -l /var/lib/ceph/osd/ceph-$id/journal ? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com