[ceph-users] ceph Performance random write is more then sequential
Hi I have installed 6 node ceph cluster and to my surprise when I ran rados bench I saw that random write has more performance number then sequential write. This is opposite to normal disk write. Can some body let me know if I am missing any ceph Architecture point here ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Question about primary OSD of a pool
Hello, If I write 2 different objects, eg, john and paul respectively to a same pool like testpool in the cluster, is the primary OSD calculated by CRUSH for the 2 objects the same? -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Moving a Ceph cluster (to a new network)
Hi Don, I reconfigured the Monitors network recently. My environment is ceph 0.80.7; Openstack Icehouse ; nova, glance, cinder using ceph RBD ; RHEL7.0 nodes. The first thing to do is to check that your new network config will allow communications between your MONs (I assume you have 3 mons), with the OSDs and with the Ceph clients. You have to make sure that basic network connectivity works, that port 6789 is open, and that you don't have any MTU issues. Basically I followed this proc: http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way I did the update in one shot, which is disruptive, acceptable in my non-production environment : - update ceph.conf so that it includes the new subnet(s), and remove the old mon subnets. In my case I have three public subnets in the target config, with routing in between I updated as well mon initial members accordingly. - deploy ceph.conf to the Ceph nodes and to ceph client nodes - build a new monmap - stop all mons - inject the new monmap to each node - start the mon on each node This should establish the quorum, and all mons should be online. While ceph was happy with the new config, I then hit a pretty bad issue with OpenStack and RBD, which prevented any VM from starting. The reason is that /var/lib/nova/instances/instance UUID/libvirt.xml and /etc/libvirt/qemu/VM name.xml keep the old IP definition of the monitors. Rebooting hard the VM does not solve the problem. I solved this by migrating all VMs. This rebuilds the XML file. virsh edit is maybe the best solution. I haven't tried it yet. I will reconfigure the OSD network in a couple of days. HTH Francois ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about primary OSD of a pool
Hi You can verify the exact mapping using the following command: ceph osd map {poolname} {objectname} Check page http://docs.ceph.com/docs/master/man/8/ceph for the ceph command. Cheers JC While moving. Excuse unintended typos. On Feb 1, 2015, at 08:04, Loic Dachary l...@dachary.org wrote: On 01/02/2015 14:47, Dennis Chen wrote: Hello, If I write 2 different objects, eg, john and paul respectively to a same pool like testpool in the cluster, is the primary OSD calculated by CRUSH for the 2 objects the same? Hi, CRUSH is likely to place john on an OSD and paul on another OSD. Cheers -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] estimate the impact of changing pg_num
Hi folks, I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs hosted on 33 new servers were added, and I have finished balancing the data and then marked the 33 old OSDs out. As I have 6x as many OSDs, I am thinking of increasing pg_num of my largest pool from 1k to at least 8k. What worries me is that this cluster has around 10M objects and is supporting many production VMs with RBD. I am wondering if there is a good way to estimate the amount of data that will be shuffled after I increase the PG_NUM. I want to make sure this can be done within a reasonable amount of time, such that I can declare a proper maintenance window (either over night, or throughout a weekend..) Thanks! -Simon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] erasure code : number of chunks for a small cluster ?
Hi Alexandre, nice to meet you here ;-) With 3 hosts only you can't survive an full node failure, because for that you need host = k + m. And k:1 m:2 don't make any sense. I start with 5 hosts and use k:3, m:2. In this case two hdds can fail or one host can be down for maintenance. Udo PS: you also can't change k+m on a pool later... On 01.02.2015 18:15, Alexandre DERUMIER wrote: Hi, I'm currently trying to understand how to setup correctly a pool with erasure code https://ceph.com/docs/v0.80/dev/osd_internals/erasure_coding/developer_notes/ My cluster is 3 nodes with 6 osd for each node (18 osd total). I want to be able to survive of 2 disk failures, but also a full node failure. What is the best setup for this ? Does I need M=2 or M=6 ? Also, how to determinate the best chunk number ? for example, K = 4 , M=2 K = 8 , M=2 K = 16 , M=2 you can loose which each config 2 osd, but the more data chunks you have, the less space is used by coding chunks right ? Does the number of chunk have performance impact ? (read/write ?) Regards, Alexandre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs: from a file name determine the objects name
Hi All, CEPHFS - Given a file name, how can one determine the exact location and the name of the objects on OSDs. So far I could understand that the objects data is stored in .../current dir in OSDs, but what naming convention do they use? Many thanks in advance Thanks Mudit ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph Performance random write is more then sequential
Hi Sumit Do you have cache pool tiering activated ? Some feed-back regarding your architecture ? Thanks Sent from my iPad On 1 févr. 2015, at 15:50, Sumit Gaur sumitkg...@gmail.com wrote: Hi I have installed 6 node ceph cluster and to my surprise when I ran rados bench I saw that random write has more performance number then sequential write. This is opposite to normal disk write. Can some body let me know if I am missing any ceph Architecture point here ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD capacity variance ?
Hi Howard, I assume it's an typo with 160 + 250 MB. Ceph OSDs must be min. 10GB to get an weight of 0.01 Udo On 31.01.2015 23:39, Howard Thomson wrote: Hi All, I am developing a custom disk storage backend for the Bacula backup system, and am in the process of setting up a trial Ceph system, intending to use a direct interface to RADOS. I have a variety of 1Tb, 250Mb and 160Mb disk drives that I would like to use, but it is not [as yet] obvious as to whether having differences in capacity at different OSDs matters. Can anyone comment, or point me in the right direction on docs.ceph.com ? Thanks, Howard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] estimate the impact of changing pg_num
Hi, I don't know the general calculation, but last week we split a pool with 20 million tiny objects from 512 to 1024 pgs, on a cluster with 80 OSDs. IIRC around 7 million objects needed to move, and it took around 13 hours to finish. The bottleneck in our case was objects per second (limited to around 1000/s), not network throughput (which never exceeded ~50MB/s). It wasn't completely transparent... the time to write a 4kB object increased from 5ms to around 30ms during this splitting process. I would guess that if you split from 1k to 8k pgs, around 80% of your data will move. Basically, 7 out of 8 objects will be moved to a new primary PG, but any objects that end up with 2nd or 3rd copies on the first 1k PGs should not need to be moved. I'd also be interested to hear of similar splitting experiences. We've been planning a similar intervention on our larger cluster to move from 4k PGs to 16k. I have been considering making the change gradually (10-100 PGs at a time) instead of all at once. This approach would certainly lower the performance impact, but would take much much longer to complete. I wrote a short script to perform this gentle splitting here: https://github.com/cernceph/ceph-scripts/blob/master/tools/split/ceph-gentle-split Be sure to understand what it's doing before trying it. Cheers, Dan On 1 Feb 2015 18:21, Xu (Simon) Chen xche...@gmail.com wrote: Hi folks, I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs hosted on 33 new servers were added, and I have finished balancing the data and then marked the 33 old OSDs out. As I have 6x as many OSDs, I am thinking of increasing pg_num of my largest pool from 1k to at least 8k. What worries me is that this cluster has around 10M objects and is supporting many production VMs with RBD. I am wondering if there is a good way to estimate the amount of data that will be shuffled after I increase the PG_NUM. I want to make sure this can be done within a reasonable amount of time, such that I can declare a proper maintenance window (either over night, or throughout a weekend..) Thanks! -Simon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Arbitrary OSD Number Assignment
Hello, In the past we've been able to manually create specific and arbitrary OSD numbers. Using the procedure: 1. Add OSD.# to ceph.conf (replicate) 2. Make necessary dir in /var/lib/ceph/osd/ceph-# 3. Create OSD+Journal partitions and filesystems, then mount it 4. Init data dirs with: ceph-osd -i # --mkfs --mkjournal 5. Create osd.# auth via keyfile 6. Edit crushmap, if necessary, reinject 7. Execute ceph-osd, as normal * The above step would create the OSD.# in OSD Map, if it did not already exist, while launching the OSD daemon ** This procedure has also avoided the need to ever run the manual deployment command of ceph osd create [uuid] We have been defining per-host OSD number ranges to quickly identify which host holds an OSD number, and this also makes crushmap editing more intuitive, and based on easy number patterns. This has worked since pre-Argonaut. It seems the newest point release of Firefly, the ceph-osd daemon no longer creates it's OSD entry upon first-launch. Is there a back-door, or --yes-i-really-mean-it work around to accomplish this need? Going to sequential OSD number assignments would be **VERY** painful in our work flow. May I suggest adding an optional 2nd param to ceph osd create [uuid] [--osd-num=#], which would do the internal work of verifying uniqueness, creation, and setting max_osd? Best Regards, Ron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about primary OSD of a pool
On 01/02/2015 14:47, Dennis Chen wrote: Hello, If I write 2 different objects, eg, john and paul respectively to a same pool like testpool in the cluster, is the primary OSD calculated by CRUSH for the 2 objects the same? Hi, CRUSH is likely to place john on an OSD and paul on another OSD. Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs: from a file name determine the objects name
It's inode number (in hex), then ., then block number (in hex). You can get the ino of a file with stat. sage On February 1, 2015 5:08:18 PM GMT+01:00, Mudit Verma mudit.f2004...@gmail.com wrote: Hi All, CEPHFS - Given a file name, how can one determine the exact location and the name of the objects on OSDs. So far I could understand that the objects data is stored in .../current dir in OSDs, but what naming convention do they use? Many thanks in advance Thanks Mudit ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Sent from Kaiten Mail. Please excuse my brevity.___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] erasure code : number of chunks for a small cluster ?
Hi, I'm currently trying to understand how to setup correctly a pool with erasure code https://ceph.com/docs/v0.80/dev/osd_internals/erasure_coding/developer_notes/ My cluster is 3 nodes with 6 osd for each node (18 osd total). I want to be able to survive of 2 disk failures, but also a full node failure. What is the best setup for this ? Does I need M=2 or M=6 ? Also, how to determinate the best chunk number ? for example, K = 4 , M=2 K = 8 , M=2 K = 16 , M=2 you can loose which each config 2 osd, but the more data chunks you have, the less space is used by coding chunks right ? Does the number of chunk have performance impact ? (read/write ?) Regards, Alexandre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] erasure code : number of chunks for a small cluster ?
Hi Alexandre, On 01/02/2015 18:15, Alexandre DERUMIER wrote: Hi, I'm currently trying to understand how to setup correctly a pool with erasure code https://ceph.com/docs/v0.80/dev/osd_internals/erasure_coding/developer_notes/ My cluster is 3 nodes with 6 osd for each node (18 osd total). I want to be able to survive of 2 disk failures, but also a full node failure. If you have K=2,M=1 you will survive one node failure. If your failure domain is the host (i.e. there never is more than one chunk per node for any given object), it will also survive two disks failures within a given node because only one of them will have a chunk. It won't be able to resist the simultaneous failure of two OSDs that belong to two different nodes: that would be the same as having two simultaneous node failure. What is the best setup for this ? Does I need M=2 or M=6 ? Also, how to determinate the best chunk number ? for example, K = 4 , M=2 K = 8 , M=2 K = 16 , M=2 you can loose which each config 2 osd, but the more data chunks you have, the less space is used by coding chunks right ? Yes. Does the number of chunk have performance impact ? (read/write ?) If there are more chunks there is an additional computation overhead but I'm not sure what's the impact. I suspect it's not significant when but never actually measured it. Cheers Regards, Alexandre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD caching on 4K reads???
On Fri, Jan 30, 2015 at 10:09:32PM +0100, Udo Lembke wrote: Hi Bruce, you can also look on the mon, like ceph --admin-daemon /var/run/ceph/ceph-mon.b.asok config show | grep cache rbd cache is a client setting, so you have to check this connecting to the client admin socket. Its location is defined in ceph.conf, [client] section, admin socket parameter. -- Mykola Golub ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] estimate the impact of changing pg_num
Hi, When do you see thousands of slow requests during recovery... Does that happen even with single OSD failures? You should be able to recover disks without slow requests. I always run with recovery op priority at the minimum 1. Tweaking the number of max backfills did not change much during that recent splitting exercise. Which Ceph version are you running? There have been snap trim related recovery problems that have only recently been fixed in production releases. 0.80.8 is OK, but I don't know about giant... Cheers, Dan On 1 Feb 2015 21:39, Xu (Simon) Chen xche...@gmail.com wrote: In my case, each object is 8MB (glance default for storing images on rbd backend.) RBD doesn't work extremely well when ceph is recovering - it is common to see hundreds or a few thousands of blocked requests (30s to finish). This translates high IO wait inside of VMs, and many applications don't deal with this well. I am not convinced that increase pg_num gradually is the right way to go. Have you tried giving backfilling traffic very low priorities? Thanks. -Simon On Sun, Feb 1, 2015 at 2:39 PM, Dan van der Ster d...@vanderster.com wrote: Hi, I don't know the general calculation, but last week we split a pool with 20 million tiny objects from 512 to 1024 pgs, on a cluster with 80 OSDs. IIRC around 7 million objects needed to move, and it took around 13 hours to finish. The bottleneck in our case was objects per second (limited to around 1000/s), not network throughput (which never exceeded ~50MB/s). It wasn't completely transparent... the time to write a 4kB object increased from 5ms to around 30ms during this splitting process. I would guess that if you split from 1k to 8k pgs, around 80% of your data will move. Basically, 7 out of 8 objects will be moved to a new primary PG, but any objects that end up with 2nd or 3rd copies on the first 1k PGs should not need to be moved. I'd also be interested to hear of similar splitting experiences. We've been planning a similar intervention on our larger cluster to move from 4k PGs to 16k. I have been considering making the change gradually (10-100 PGs at a time) instead of all at once. This approach would certainly lower the performance impact, but would take much much longer to complete. I wrote a short script to perform this gentle splitting here: https://github.com/cernceph/ceph-scripts/blob/master/tools/split/ceph-gentle-split Be sure to understand what it's doing before trying it. Cheers, Dan On 1 Feb 2015 18:21, Xu (Simon) Chen xche...@gmail.com wrote: Hi folks, I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs hosted on 33 new servers were added, and I have finished balancing the data and then marked the 33 old OSDs out. As I have 6x as many OSDs, I am thinking of increasing pg_num of my largest pool from 1k to at least 8k. What worries me is that this cluster has around 10M objects and is supporting many production VMs with RBD. I am wondering if there is a good way to estimate the amount of data that will be shuffled after I increase the PG_NUM. I want to make sure this can be done within a reasonable amount of time, such that I can declare a proper maintenance window (either over night, or throughout a weekend..) Thanks! -Simon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD snap unprotect need ACLs on all pools ?
Hi, I’ve ACL with key/user on on 1 pool (client.condor rwx on pool rbdpartigsanmdev01) I would like to unprotect snapshot but I’ve below error : rbd -n client.condor snap unprotect rbdpartigsanmdev01/flaprdsvc01_lun003@sync#1.cloneref.2015-02-01.19:07:21 2015-02-01 22:53:00.903790 7f4d0036e760 -1 librbd: can't get children for pool .rgw.root rbd: unprotecting snap failed: (1) Operation not permitted After check of source code (https://github.com/ceph/ceph/blob/master/src/librbd/internal.cc https://github.com/ceph/ceph/blob/master/src/librbd/internal.cc line 715), with unprotect action, CEPH would like to check all cluster pools And we have access only on 1 pool of cluster… Is it a bug ? Or usage is not good on my side ? Thanks Florent Monthel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] estimate the impact of changing pg_num
In my case, each object is 8MB (glance default for storing images on rbd backend.) RBD doesn't work extremely well when ceph is recovering - it is common to see hundreds or a few thousands of blocked requests (30s to finish). This translates high IO wait inside of VMs, and many applications don't deal with this well. I am not convinced that increase pg_num gradually is the right way to go. Have you tried giving backfilling traffic very low priorities? Thanks. -Simon On Sun, Feb 1, 2015 at 2:39 PM, Dan van der Ster d...@vanderster.com wrote: Hi, I don't know the general calculation, but last week we split a pool with 20 million tiny objects from 512 to 1024 pgs, on a cluster with 80 OSDs. IIRC around 7 million objects needed to move, and it took around 13 hours to finish. The bottleneck in our case was objects per second (limited to around 1000/s), not network throughput (which never exceeded ~50MB/s). It wasn't completely transparent... the time to write a 4kB object increased from 5ms to around 30ms during this splitting process. I would guess that if you split from 1k to 8k pgs, around 80% of your data will move. Basically, 7 out of 8 objects will be moved to a new primary PG, but any objects that end up with 2nd or 3rd copies on the first 1k PGs should not need to be moved. I'd also be interested to hear of similar splitting experiences. We've been planning a similar intervention on our larger cluster to move from 4k PGs to 16k. I have been considering making the change gradually (10-100 PGs at a time) instead of all at once. This approach would certainly lower the performance impact, but would take much much longer to complete. I wrote a short script to perform this gentle splitting here: https://github.com/cernceph/ceph-scripts/blob/master/tools/split/ceph-gentle-split Be sure to understand what it's doing before trying it. Cheers, Dan On 1 Feb 2015 18:21, Xu (Simon) Chen xche...@gmail.com wrote: Hi folks, I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs hosted on 33 new servers were added, and I have finished balancing the data and then marked the 33 old OSDs out. As I have 6x as many OSDs, I am thinking of increasing pg_num of my largest pool from 1k to at least 8k. What worries me is that this cluster has around 10M objects and is supporting many production VMs with RBD. I am wondering if there is a good way to estimate the amount of data that will be shuffled after I increase the PG_NUM. I want to make sure this can be done within a reasonable amount of time, such that I can declare a proper maintenance window (either over night, or throughout a weekend..) Thanks! -Simon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] estimate the impact of changing pg_num
Hi Xu, On 01.02.2015 21:39, Xu (Simon) Chen wrote: RBD doesn't work extremely well when ceph is recovering - it is common to see hundreds or a few thousands of blocked requests (30s to finish). This translates high IO wait inside of VMs, and many applications don't deal with this well. this sounds like you don't have settings like osd max backfills = 1 osd recovery max active = 1 Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph Performance random write is more then sequential
Hi Florent, Cache tiering , No . ** Our Architecture : vdbench/FIO inside VM -- RBD without cache - Ceph Cluster (6 OSDs + 3 Mons) Thanks sumit [root@ceph-mon01 ~]# ceph -s cluster 47b3b559-f93c-4259-a6fb-97b00d87c55a health HEALTH_WARN clock skew detected on mon.ceph-mon02, mon.ceph-mon03 monmap e1: 3 mons at {ceph-mon01= 192.168.10.19:6789/0,ceph-mon02=192.168.10.20:6789/0,ceph-mon03=192.168.10.21:6789/0}, election epoch 14, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 osdmap e603: 36 osds: 36 up, 36 in pgmap v40812: 5120 pgs, 2 pools, 179 GB data, 569 kobjects 522 GB used, 9349 GB / 9872 GB avail 5120 active+clean On Mon, Feb 2, 2015 at 12:21 AM, Florent MONTHEL fmont...@flox-arts.net wrote: Hi Sumit Do you have cache pool tiering activated ? Some feed-back regarding your architecture ? Thanks Sent from my iPad On 1 févr. 2015, at 15:50, Sumit Gaur sumitkg...@gmail.com wrote: Hi I have installed 6 node ceph cluster and to my surprise when I ran rados bench I saw that random write has more performance number then sequential write. This is opposite to normal disk write. Can some body let me know if I am missing any ceph Architecture point here ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] error opening rbd image
Hello I cant open rbd image after restart cluster. I user rbd image for KVM virtual machine. ceph version 0.87 uname -a Linux ceph4 3.14.31-gentoo #1 SMP Fri Jan 30 22:24:11 YEKT 2015 x86_64 Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz GenuineIntel GNU/Linux rbd info raid0n/homes rbd: error opening image homes: (6) No such device or address 2015-02-02 07:13:41.334712 7f5e190ff780 -1 librbd: unrecognized header format 2015-02-02 07:13:41.334726 7f5e190ff780 -1 librbd: Error reading header: (6) No such device or address rados get -p raid0n rbd_directory - | strings homes rados get -p raid0n homes - | strings error getting raid0n/homes: (2) No such file or directory ceph pg stat v37728538: 784 pgs: 784 active+clean; 1447 GB data, 3614 GB used, 3186 GB / 6801 GB avail; 21305 B/s rd, 24634 B/s wr, 10 op/s rbd export raid0n/homes rbd: error opening image homes: (6) No such device or address 2015-02-02 07:17:19.188832 7f203ad17780 -1 librbd: unrecognized header format 2015-02-02 07:17:19.188844 7f203ad17780 -1 librbd: Error reading header: (6) No such device or address How I can repair this? Aleksey Leonov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD can't start After server restart
Ceph Version: 0.80.1 Server Number: 4 OSD Number: 6disks per server All of the OSDs of one server can't start After this server restart, but other 3 servers can. --- ceph -s: [root@dn1 ~]# ceph -s cluster 73ceed62-9a53-414b-95dd-61f802251df4 health HEALTH_WARN 65 pgs stale; 65 pgs stuck stale; 51 requests are blocked 32 sec; pool .rgw.buckets has too few pgs; clock skew detected on mon.1, mon.2 monmap e1: 3 mons at {0=172.16.0.166:6789/0,1=172.16.0.167:6789/0,2=172.16.0.168:6789/0}, election epoch 628, quorum 0,1,2 0,1,2 osdmap e513: 24 osds: 18 up, 18 in pgmap v707911: 7424 pgs, 14 pools, 513 GB data, 310 kobjects 9321 GB used, 6338 GB / 16498 GB avail 65 stale+active+clean 7359 active+clean [root@dn1 ~]# cat /var/log/ceph/osd.6.log 2015-02-02 10:08:11.534384 7f771012f7a0 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 22691 2015-02-02 10:08:11.986865 7f771012f7a0 0 genericfilestorebackend(/cache4/osd.6) detect_features: FIEMAP ioctl is supported and appears to work 2015-02-02 10:08:11.986910 7f771012f7a0 0 genericfilestorebackend(/cache4/osd.6) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-02-02 10:08:12.612637 7f771012f7a0 0 genericfilestorebackend(/cache4/osd.6) detect_features: syscall(SYS_syncfs, fd) fully supported 2015-02-02 10:08:12.612824 7f771012f7a0 -1 filestore(/cache4/osd.6) Extended attributes don't appear to work. Got error (95) Operation not supported. If you are using ext3 or ext4, be sure to mount the underlying file system with the 'user_xattr' option. 2015-02-02 10:08:12.612942 7f771012f7a0 -1 filestore(/cache4/osd.6) FileStore::mount : error in _detect_fs: (95) Operation not supported 2015-02-02 10:08:12.612964 7f771012f7a0 -1 ** ERROR: error converting store /cache4/osd.6: (95) Operation not supported --- [root@dn1 ~]# mount none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) /dev/sde1 on /cache4 type ext4 (rw,noatime,user_xattr) /dev/sdf1 on /cache5 type ext4 (rw,noatime,user_xattr) /dev/sdg1 on /cache6 type ext4 (rw,noatime,user_xattr) /dev/sdh1 on /cache7 type ext4 (rw,noatime,user_xattr) /dev/sdi1 on /cache8 type ext4 (rw,noatime,user_xattr) /dev/sdj1 on /cache9 type ext4 (rw,noatime,user_xattr) # Other 3 server's disks also use ext4 with rw,noatime,user_xattr. What's the possible reason?___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about primary OSD of a pool
On Mon, Feb 2, 2015 at 12:04 AM, Loic Dachary l...@dachary.org wrote: On 01/02/2015 14:47, Dennis Chen wrote: Hello, If I write 2 different objects, eg, john and paul respectively to a same pool like testpool in the cluster, is the primary OSD calculated by CRUSH for the 2 objects the same? Hi, CRUSH is likely to place john on an OSD and paul on another OSD. Cheers -- Loïc Dachary, Artisan Logiciel Libre Thank Loic, correct! I have no cluster to access at hand to do the test when I asked the question :) -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about primary OSD of a pool
Thanks, I've have the answer with the 'ceph osd map ...' command On Mon, Feb 2, 2015 at 12:50 AM, Jean-Charles Lopez jelo...@redhat.com wrote: Hi You can verify the exact mapping using the following command: ceph osd map {poolname} {objectname} Check page http://docs.ceph.com/docs/master/man/8/ceph for the ceph command. Cheers JC While moving. Excuse unintended typos. On Feb 1, 2015, at 08:04, Loic Dachary l...@dachary.org wrote: On 01/02/2015 14:47, Dennis Chen wrote: Hello, If I write 2 different objects, eg, john and paul respectively to a same pool like testpool in the cluster, is the primary OSD calculated by CRUSH for the 2 objects the same? Hi, CRUSH is likely to place john on an OSD and paul on another OSD. Cheers -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph Performance random write is more then sequential
Hi All, What I saw after enabling RBD cache it is working as expected, means sequential write has better MBps than random write. can somebody explain this behaviour ? Is RBD cache setting must for ceph cluster to behave normally ? Thanks sumit On Mon, Feb 2, 2015 at 9:59 AM, Sumit Gaur sumitkg...@gmail.com wrote: Hi Florent, Cache tiering , No . ** Our Architecture : vdbench/FIO inside VM -- RBD without cache - Ceph Cluster (6 OSDs + 3 Mons) Thanks sumit [root@ceph-mon01 ~]# ceph -s cluster 47b3b559-f93c-4259-a6fb-97b00d87c55a health HEALTH_WARN clock skew detected on mon.ceph-mon02, mon.ceph-mon03 monmap e1: 3 mons at {ceph-mon01= 192.168.10.19:6789/0,ceph-mon02=192.168.10.20:6789/0,ceph-mon03=192.168.10.21:6789/0}, election epoch 14, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 osdmap e603: 36 osds: 36 up, 36 in pgmap v40812: 5120 pgs, 2 pools, 179 GB data, 569 kobjects 522 GB used, 9349 GB / 9872 GB avail 5120 active+clean On Mon, Feb 2, 2015 at 12:21 AM, Florent MONTHEL fmont...@flox-arts.net wrote: Hi Sumit Do you have cache pool tiering activated ? Some feed-back regarding your architecture ? Thanks Sent from my iPad On 1 févr. 2015, at 15:50, Sumit Gaur sumitkg...@gmail.com wrote: Hi I have installed 6 node ceph cluster and to my surprise when I ran rados bench I saw that random write has more performance number then sequential write. This is opposite to normal disk write. Can some body let me know if I am missing any ceph Architecture point here ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about primary OSD of a pool
Hello Dennis, You can create CRUSH rule to select one of osd as primary as: rule ssd-primary { ruleset 5 type replicated min_size 5 max_size 10 step take ssd step chooseleaf firstn 1 type host step emit step take platter step chooseleaf firstn -1 type host step emit } Above will select a OSD from ssd bucket as Primary and all secondary OSD from bucket platter. You can find more details about CRUSH map at http://ceph.com/docs/master/rados/operations/crush-map/ . Regards, Sudarshan On Mon, Feb 2, 2015 at 12:12 PM, Dennis Chen kernel.org@gmail.com wrote: Hi Sudarshan, Some hints for doing that ? On Mon, Feb 2, 2015 at 1:03 PM, Sudarshan Pathak sushan@gmail.com wrote: BTW, you can make crush to always choose the same OSD as primary. Regards, Sudarshan On Mon, Feb 2, 2015 at 9:26 AM, Dennis Chen kernel.org@gmail.com wrote: Thanks, I've have the answer with the 'ceph osd map ...' command On Mon, Feb 2, 2015 at 12:50 AM, Jean-Charles Lopez jelo...@redhat.com wrote: Hi You can verify the exact mapping using the following command: ceph osd map {poolname} {objectname} Check page http://docs.ceph.com/docs/master/man/8/ceph for the ceph command. Cheers JC While moving. Excuse unintended typos. On Feb 1, 2015, at 08:04, Loic Dachary l...@dachary.org wrote: On 01/02/2015 14:47, Dennis Chen wrote: Hello, If I write 2 different objects, eg, john and paul respectively to a same pool like testpool in the cluster, is the primary OSD calculated by CRUSH for the 2 objects the same? Hi, CRUSH is likely to place john on an OSD and paul on another OSD. Cheers -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about primary OSD of a pool
BTW, you can make crush to always choose the same OSD as primary. Regards, Sudarshan On Mon, Feb 2, 2015 at 9:26 AM, Dennis Chen kernel.org@gmail.com wrote: Thanks, I've have the answer with the 'ceph osd map ...' command On Mon, Feb 2, 2015 at 12:50 AM, Jean-Charles Lopez jelo...@redhat.com wrote: Hi You can verify the exact mapping using the following command: ceph osd map {poolname} {objectname} Check page http://docs.ceph.com/docs/master/man/8/ceph for the ceph command. Cheers JC While moving. Excuse unintended typos. On Feb 1, 2015, at 08:04, Loic Dachary l...@dachary.org wrote: On 01/02/2015 14:47, Dennis Chen wrote: Hello, If I write 2 different objects, eg, john and paul respectively to a same pool like testpool in the cluster, is the primary OSD calculated by CRUSH for the 2 objects the same? Hi, CRUSH is likely to place john on an OSD and paul on another OSD. Cheers -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph Performance random write is more then sequential
Sumit, I think random read/write will always outperform sequential read/write in Ceph if we don’t have any kind of cache in front or you have proper striping enabled in the image. The reason is the following. 1. If you are trying with the default image option, the object size is 4 MB and the stripe size = 4MB and stripe unit = 1. 2. You didn’t mention your write size, so, if it is less than 4 MB , 2 seq write will always land on a same PG and it will be serialized within the OSD. 3. But, if we have 2 random writes it will always (more probable) to land on different PGs and it will be processed in parallel. 4. Same will happen in case of random vs seq read as well. Increasing read_ahead_kb to a reasonable big number will improve the seq read speed. If you are using librbd, rbd_cache will help you both for read/write I guess. 5. Another option you may want to try to set the strip_size/object_size/stripe_unit to your io_size so that seq read/write can land on different object and in that case the difference should go away. Hope this is helpful. Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Sumit Gaur Sent: Sunday, February 01, 2015 6:55 PM To: Florent MONTHEL Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph Performance random write is more then sequential Hi All, What I saw after enabling RBD cache it is working as expected, means sequential write has better MBps than random write. can somebody explain this behaviour ? Is RBD cache setting must for ceph cluster to behave normally ? Thanks sumit On Mon, Feb 2, 2015 at 9:59 AM, Sumit Gaur sumitkg...@gmail.commailto:sumitkg...@gmail.com wrote: Hi Florent, Cache tiering , No . ** Our Architecture : vdbench/FIO inside VM -- RBD without cache - Ceph Cluster (6 OSDs + 3 Mons) Thanks sumit [root@ceph-mon01 ~]# ceph -s cluster 47b3b559-f93c-4259-a6fb-97b00d87c55a health HEALTH_WARN clock skew detected on mon.ceph-mon02, mon.ceph-mon03 monmap e1: 3 mons at {ceph-mon01=192.168.10.19:6789/0,ceph-mon02=192.168.10.20:6789/0,ceph-mon03=192.168.10.21:6789/0http://192.168.10.19:6789/0,ceph-mon02=192.168.10.20:6789/0,ceph-mon03=192.168.10.21:6789/0}, election epoch 14, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 osdmap e603: 36 osds: 36 up, 36 in pgmap v40812: 5120 pgs, 2 pools, 179 GB data, 569 kobjects 522 GB used, 9349 GB / 9872 GB avail 5120 active+clean On Mon, Feb 2, 2015 at 12:21 AM, Florent MONTHEL fmont...@flox-arts.netmailto:fmont...@flox-arts.net wrote: Hi Sumit Do you have cache pool tiering activated ? Some feed-back regarding your architecture ? Thanks Sent from my iPad On 1 févr. 2015, at 15:50, Sumit Gaur sumitkg...@gmail.commailto:sumitkg...@gmail.com wrote: Hi I have installed 6 node ceph cluster and to my surprise when I ran rados bench I saw that random write has more performance number then sequential write. This is opposite to normal disk write. Can some body let me know if I am missing any ceph Architecture point here ? ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about primary OSD of a pool
Hi Sudarshan, Some hints for doing that ? On Mon, Feb 2, 2015 at 1:03 PM, Sudarshan Pathak sushan@gmail.com wrote: BTW, you can make crush to always choose the same OSD as primary. Regards, Sudarshan On Mon, Feb 2, 2015 at 9:26 AM, Dennis Chen kernel.org@gmail.com wrote: Thanks, I've have the answer with the 'ceph osd map ...' command On Mon, Feb 2, 2015 at 12:50 AM, Jean-Charles Lopez jelo...@redhat.com wrote: Hi You can verify the exact mapping using the following command: ceph osd map {poolname} {objectname} Check page http://docs.ceph.com/docs/master/man/8/ceph for the ceph command. Cheers JC While moving. Excuse unintended typos. On Feb 1, 2015, at 08:04, Loic Dachary l...@dachary.org wrote: On 01/02/2015 14:47, Dennis Chen wrote: Hello, If I write 2 different objects, eg, john and paul respectively to a same pool like testpool in the cluster, is the primary OSD calculated by CRUSH for the 2 objects the same? Hi, CRUSH is likely to place john on an OSD and paul on another OSD. Cheers -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fwd: error opening rbd image
Hello I cant open rbd image after restart cluster. I user rbd image for KVM virtual machine. ceph version 0.87 uname -a Linux ceph4 3.14.31-gentoo #1 SMP Fri Jan 30 22:24:11 YEKT 2015 x86_64 Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz GenuineIntel GNU/Linux rbd info raid0n/homes rbd: error opening image homes: (6) No such device or address 2015-02-02 07:13:41.334712 7f5e190ff780 -1 librbd: unrecognized header format 2015-02-02 07:13:41.334726 7f5e190ff780 -1 librbd: Error reading header: (6) No such device or address rados get -p raid0n rbd_directory - | strings homes rados get -p raid0n homes - | strings error getting raid0n/homes: (2) No such file or directory ceph pg stat v37728538: 784 pgs: 784 active+clean; 1447 GB data, 3614 GB used, 3186 GB / 6801 GB avail; 21305 B/s rd, 24634 B/s wr, 10 op/s rbd export raid0n/homes rbd: error opening image homes: (6) No such device or address 2015-02-02 07:17:19.188832 7f203ad17780 -1 librbd: unrecognized header format 2015-02-02 07:17:19.188844 7f203ad17780 -1 librbd: Error reading header: (6) No such device or address How I can repair this? Aleksey Leonov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com