Re: [ceph-users] full osd ssd cluster advise : replication 2x or 3x ?
BTW, the new samsung PM853T SSD, announce 665 TBW for 4K random write http://www.tomsitpro.com/articles/samsung-3-bit-nand-enterprise-ssd,1-1922.html and price are cheaper than intel s3500. (around 450€ ex vat) (Cluster will be build next year, so I have some time before choose the good one ssd) my main concern, is to known if it's really needed to have replication x3 (mainly for cost price). But I can wait to have lower ssd price next year, and go to 3x if necessary. - Mail original - De: Alexandre DERUMIER aderum...@odiso.com À: Christian Balzer ch...@gol.com Cc: ceph-users@lists.ceph.com Envoyé: Vendredi 23 Mai 2014 07:59:58 Objet: Re: [ceph-users] full osd ssd cluster advise : replication 2x or 3x ? That's not the only thing you should worry about. Aside from the higher risk there's total cost of ownership or Cost per terabyte written ($/TBW). So while the DC S3700 800GB is about $1800 and the same sized DC S3500 at about $850, the 3700 can reliably store 7300TB while the 3500 is only rated for 450TB. You do the math. ^.^ Yes, I known,I have already do the math. But I'm far from reach this amount of write. workload is (really) random, so 20% of write of 3iops, 4k block = 25MB/s of write, 2TB each day. with replication 3x, 6TB each day of write. 60x450TBW = 27000TBW / 6TB = 4500 days = 12,5 years ;) so with journal write, it of course less, but I think it should be enough for 5 years I'll also test key-value store, as no more journal, less write. (Not sure it works fine with rbd for the moment) - Mail original - De: Christian Balzer ch...@gol.com À: ceph-users@lists.ceph.com Envoyé: Vendredi 23 Mai 2014 07:29:52 Objet: Re: [ceph-users] full osd ssd cluster advise : replication 2x or 3x ? On Fri, 23 May 2014 07:02:15 +0200 (CEST) Alexandre DERUMIER wrote: What is your main goal for that cluster, high IOPS, high sequential writes or reads? high iops, mostly random. (it's an rbd cluster, with qemu-kvm guest, around 1000vms, doing smalls ios each one). 80%read|20% write I don't care about sequential workload, or bandwith. Remember my Slow IOPS on RBD... thread, you probably shouldn't expect more than 800 write IOPS and 4000 read IOPS per OSD (replication 2). Yes, that's enough for me ! I can't use spinner disk, because it's really too slow. I need around 3iops for around 20TB of storage. I could even go to cheaper consummer ssd (like crucial m550), I think I could reach 2000-4000 iops from it. But I'm afraid of durability|stability. That's not the only thing you should worry about. Aside from the higher risk there's total cost of ownership or Cost per terabyte written ($/TBW). So while the DC S3700 800GB is about $1800 and the same sized DC S3500 at about $850, the 3700 can reliably store 7300TB while the 3500 is only rated for 450TB. You do the math. ^.^ Christian - Mail original - De: Christian Balzer ch...@gol.com À: ceph-users@lists.ceph.com Envoyé: Vendredi 23 Mai 2014 04:57:51 Objet: Re: [ceph-users] full osd ssd cluster advise : replication 2x or 3x ? Hello, On Thu, 22 May 2014 18:00:56 +0200 (CEST) Alexandre DERUMIER wrote: Hi, I'm looking to build a full osd ssd cluster, with this config: What is your main goal for that cluster, high IOPS, high sequential writes or reads? Remember my Slow IOPS on RBD... thread, you probably shouldn't expect more than 800 write IOPS and 4000 read IOPS per OSD (replication 2). 6 nodes, each node 10 osd/ ssd drives (dual 10gbit network). (1journal + datas on each osd) Halving the write speed of the SSD, leaving you with about 2GB/s max write speed per node. If you're after good write speeds and with a replication factor of 2 I would split the network into public and cluster ones. If you're however after top read speeds, use bonding for the 2 links into the public network, half of your SSDs per node are able to saturate that. ssd drive will be entreprise grade, maybe intel sc3500 800GB (well known ssd) How much write activity do you expect per OSD (remember that you in your case writes are doubled)? Those drives have a total write capacity of about 450TB (within 5 years). or new Samsung SSD PM853T 960GB (don't have too much info about it for the moment, but price seem a little bit lower than intel) Looking at the specs it seems to have a better endurance (I used 500GB/day, a value that seemed realistic given the 2 numbers they gave), at least double that of the Intel. Alas they only give a 3 year warranty, which makes me wonder. Also the latencies are significantly higher than the 3500. I would like to have some advise on replication level, Maybe somebody have experience with intel sc3500 failure rate ? I doubt many people have managed to wear out SSDs of that vintage in normal usage yet. And
Re: [ceph-users] collectd / graphite / grafana .. calamari?
https://github.com/rochaporto/collectd-ceph It has a set of collectd plugins pushing metrics which mostly map what the ceph commands return. In the setup we have it pushes them to graphite and the displays rely on grafana (check for a screenshot in the link above). Thanks for sharing ricardo ! I was looking to create a dashboard for grafana too, yours seem very good :) - Mail original - De: Ricardo Rocha rocha.po...@gmail.com À: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com) ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org Envoyé: Vendredi 23 Mai 2014 02:58:04 Objet: collectd / graphite / grafana .. calamari? Hi. I saw the thread a couple days ago on ceph-users regarding collectd... and yes, i've been working on something similar for the last few days :) https://github.com/rochaporto/collectd-ceph It has a set of collectd plugins pushing metrics which mostly map what the ceph commands return. In the setup we have it pushes them to graphite and the displays rely on grafana (check for a screenshot in the link above). As it relies on common building blocks, it's easily extensible and we'll come up with new dashboards soon - things like plotting osd data against the metrics from the collectd disk plugin, which we also deploy. This email is mostly to share the work, but also to check on Calamari? I asked Patrick after the RedHat/Inktank news and have no idea what it provides, but i'm sure it comes with lots of extra sauce - he suggested to ask in the list. What's the timeline to have it open sourced? It would be great to have a look at it, and as there's work from different people in this area maybe start working together on some fancier monitoring tools. Regards, Ricardo -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean
Dear ceph, I am trying to setup ceph 0.80.1 with the following components : 1 x mon - Debian Wheezy (i386) 3 x osds - Debian Wheezy (i386) (all are kvm powered) Status after the standard setup procedure : root@ceph-node2:~# ceph -s cluster d079dd72-8454-4b4a-af92-ef4c424d96d8 health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean monmap e1: 1 mons at {ceph-node1=192.168.123.48:6789/0}, election epoch 2, quorum 0 ceph-node1 osdmap e11: 3 osds: 3 up, 3 in pgmap v18: 192 pgs, 3 pools, 0 bytes data, 0 objects 103 MB used, 15223 MB / 15326 MB avail 192 incomplete root@ceph-node2:~# ceph health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean root@ceph-node2:~# ceph osd tree # idweight type name up/down reweight -1 0 root default -2 0 host ceph-node2 0 0 osd.0 up 1 -3 0 host ceph-node3 1 0 osd.1 up 1 -4 0 host ceph-node4 2 0 osd.2 up 1 root@ceph-node2:~# ceph osd dump epoch 11 fsid d079dd72-8454-4b4a-af92-ef4c424d96d8 created 2014-05-23 09:00:08.780211 modified 2014-05-23 09:01:33.438001 flags pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool stripe_width 0 max_osd 3 osd.0 up in weight 1 up_from 4 up_thru 5 down_at 0 last_clean_interval [0,0) 192.168.123.49:6800/11373 192.168.123.49:6801/11373 192.168.123.49:6802/11373 192.168.123.49:6803/11373 exists,up 21a7d2a8-b709-4a28-bc3b-850913fe4c6b osd.1 up in weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.123.50:6800/10542 192.168.123.50:6801/10542 192.168.123.50:6802/10542 192.168.123.50:6803/10542 exists,up c1cd3ad1-b086-438f-a22d-9034b383a1be osd.2 up in weight 1 up_from 11 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.123.53:6800/6962 192.168.123.53:6801/6962 192.168.123.53:6802/6962 192.168.123.53:6803/6962 exists,up aa06d7e4-181c-4d70-bb8e-018b088c5053 What am I doing wrong here ? Or what kind of additional information should be provided to get troubleshooted. thanks, --- Jan P.S. with emperor 0.72.2 I had no such problems ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Screencast/tutorial on setting up Ceph
Hi, I have four old machines lying around. I would like to setup ceph on these machines. Are there any screencast or tutorial with commands, on how to obtain, install and configure on ceph on these machines ? The official documentation page OS Recommendations seem to list only old distros and not the new version of distros (openSUSE and Ubuntu). So I wanted to ask if there is a screencast or tutorial or techtalk on how to setup Ceph for a total newbie ? -- Sankar P http://psankar.blogspot.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph deploy on rhel6.5 installs ceph from el6 and fails
Hi Simon, thanks for your reply. I already installed OS for my ceph-nodes via Kickstart (via network) from Redhat Satellite and I dont want to do that again because some other config had also been done. xfsprogs is not part of the rhel base repository but of some extra package with costs per node/CPU/whatever called Scalable File System. For some other nodes I installed xfsprogs from centos-6-base repo but now I want to try a clean rhel-based-only install and so I'll add ceph on my nodes from /etc/yum.repos.d/ceph, install manually with yum and then do a ceph-deploy and see what will happen ;) Greetz from munich Erik Von: Simon Ironside [sirons...@caffetine.org] Gesendet: Freitag, 23. Mai 2014 01:07 An: Lukac, Erik Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] ceph deploy on rhel6.5 installs ceph from el6 and fails On 22/05/14 23:56, Lukac, Erik wrote: But: this fails because of the dependencies. xfsprogs is in rhel6 repo, but not in el6 L I hadn't noticed that xfsprogs is included in the ceph repos, I'm using the package from the RHEL 6.5 DVD, which is the same version, you'll find it in the ScalableFileSystem repo on the Install DVD. HTH, Simon. -- Bayerischer Rundfunk; Rundfunkplatz 1; 80335 München Telefon: +49 89 590001; E-Mail: i...@br.de; Website: http://www.BR.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about osd objectstore = keyvaluestore-dev setting
发自我的 iPhone 在 2014年5月22日,22:26,Gregory Farnum g...@inktank.com 写道: On Thu, May 22, 2014 at 5:04 AM, Geert Lindemulder glindemul...@snow.nl wrote: Hello All Trying to implement the osd leveldb backend at an existing ceph test cluster. The test cluster was updated from 0.72.1 to 0.80.1. The update was ok. After the update, the osd objectstore = keyvaluestore-dev setting was added to ceph.conf. Does that mean you tried to switch to the KeyValueStore on one of your existing OSDs? That isn't going to work; you'll need to create new ones (or knock out old ones and recreate them with it). After restarting an osd it gives the following error: 2014-05-22 12:28:06.805290 7f2e7d9de800 -1 KeyValueStore::mount : stale version stamp 3. Please run the KeyValueStore update script before starting the OSD, or set keyvaluestore_update_to to 1 How can the keyvaluestore_update_to parameter be set or where can i find the KeyValueStore update script Hmm, it looks like that config value isn't actually plugged in to the KeyValueStore, so you can't set it with the stock binaries. Maybe Haomai has an idea? yes, the error is that keyvaluestore read version from existing osd data. The version is incorrect and maybe there should be more clear error message. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Screencast/tutorial on setting up Ceph
-Ursprüngliche Nachricht- Von: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] Im Auftrag von Sankar P Gesendet: Freitag, 23. Mai 2014 11:14 An: ceph-users@lists.ceph.com Betreff: [ceph-users] Screencast/tutorial on setting up Ceph Hi, I have four old machines lying around. I would like to setup ceph on these machines. Are there any screencast or tutorial with commands, on how to obtain, install and configure on ceph on these machines ? The official documentation page OS Recommendations seem to list only old distros and not the new version of distros (openSUSE and Ubuntu). So I wanted to ask if there is a screencast or tutorial or techtalk on how to setup Ceph for a total newbie ? -- Sankar P http://psankar.blogspot.com Hi, I am rookie too and only used just this : http://ceph.com/docs/master/start/ it's a very nice doc --- jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw Timeout
On 22.05.2014 15:36, Yehuda Sadeh wrote: On Thu, May 22, 2014 at 6:16 AM, Georg Höllrigl georg.hoellr...@xidras.com wrote: Hello List, Using the radosgw works fine, as long as the amount of data doesn't get too big. I have created one bucket that holds many small files, separated into different directories. But whenever I try to acess the bucket, I only run into some timeout. The timeout is at around 30 - 100 seconds. This is smaller then the Apache timeout of 300 seconds. I've tried to access the bucket with different clients - one thing is s3cmd - which still is able to upload things, but takes rather long time, when listing the contents. Then I've tried with s3fs-fuse - which throws ls: reading directory .: Input/output error Also Cyberduck and S3Browser show a similar behaivor. Is there an option, to only send back maybe 1000 list entries, like Amazon das? So that the client might decide, if he want's to list all the contents? That how it works, it doesn't return more than 1000 entries at once. OK. I found that in the Requests. So it's the client, that states how many objects should be in the listing with sending the max-keys=1000 variable: - - - [23/May/2014:08:49:33 +] GET /test/?delimiter=%2Fmax-keys=1000prefix HTTP/1.1 200 715 - Cyberduck/4.4.4 (14505) (Windows NT (unknown)/6.2) (x86) xidrasservice.com:443 Are there any timeout values in radosgw? Are you sure the timeout is in the gateway itself? Could be apache that is timing out. Will need to see the apache access logs for these operations, radosgw debug and messenger logs (debug rgw = 20, debug ms = 1), to give a better answer. No I'm not sure where the timeout comes from. As far as I can tell, apache times out after 300 seconds - so that should not be the problem. I think I found something in the apache logs: [Fri May 23 08:59:39.385548 2014] [fastcgi:error] [pid 3035:tid 140723006891776] [client 10.0.1.66:46049] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Fri May 23 08:59:39.385604 2014] [fastcgi:error] [pid 3035:tid 140723006891776] [client 10.0.1.66:46049] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi I've increased the timeout to 900 in the apache vhosts config: FastCgiExternalServer /var/www/s3gw.fcgi -socket /var/run/ceph/radosgw.vvx-ceph-m-02 -idle-timeout 900 Now it's not working, and I don't get a log entry any more. Most interesting when watching the debug output - I'm getting that rados successfully finished with the request. But at the same time, the client tells me, it failed. I've shortened the log file, as far as I could see, the info repeats itself... 2014-05-23 09:38:43.051395 7f1b427fc700 1 == starting new request req=0x7f1b3400f1c0 = 2014-05-23 09:38:43.051597 7f1b427fc700 1 -- 10.0.1.107:0/1005898 -- 10.0.1.199:6800/14453 -- osd_op(client.72942.0:120 UHXW458EH1RVULE1BCEH [getxattrs,stat] 11.10193f7e ack+read e279) v4 -- ?+0 0x7f1b4640 con 0x2455930 2014-05-23 09:38:43.053180 7f1b96d80700 1 -- 10.0.1.107:0/1005898 == osd.0 10.0.1.199:6800/14453 23 osd_op_reply(120 UHXW458EH1RVULE1BCEH [getxattrs,stat] v0'0 uv1 ondisk = 0) v6 229+0+20 (1060030390 0 1010060712) 0x7f1b58002540 con 0x2455930 2014-05-23 09:38:43.053380 7f1b427fc700 1 -- 10.0.1.107:0/1005898 -- 10.0.1.199:6800/14453 -- osd_op(client.72942.0:121 UHXW458EH1RVULE1BCEH [read 0~524288] 11.10193f7e ack+read e279) v4 -- ?+0 0x7f1b45d0 con 0x2455930 2014-05-23 09:38:43.054359 7f1b96d80700 1 -- 10.0.1.107:0/1005898 == osd.0 10.0.1.199:6800/14453 24 osd_op_reply(121 UHXW458EH1RVULE1BCEH [read 0~8] v0'0 uv1 ondisk = 0) v6 187+0+8 (3510944971 0 3829959217) 0x7f1b580057b0 con 0x2455930 2014-05-23 09:38:43.054490 7f1b427fc700 1 -- 10.0.1.107:0/1005898 -- 10.0.1.199:6806/15018 -- osd_op(client.72942.0:122 macm [getxattrs,stat] 7.1069f101 ack+read e279) v4 -- ?+0 0x7f1b6010 con 0x2457de0 2014-05-23 09:38:43.055871 7f1b96d80700 1 -- 10.0.1.107:0/1005898 == osd.2 10.0.1.199:6806/15018 3 osd_op_reply(122 macm [getxattrs,stat] v0'0 uv46 ondisk = 0) v6 213+0+91 (22324782 0 2022698800) 0x7f1b500025a0 con 0x2457de0 2014-05-23 09:38:43.055963 7f1b427fc700 1 -- 10.0.1.107:0/1005898 -- 10.0.1.199:6806/15018 -- osd_op(client.72942.0:123 macm [read 0~524288] 7.1069f101 ack+read e279) v4 -- ?+0 0x7f1b3950 con 0x2457de0 2014-05-23 09:38:43.057087 7f1b96d80700 1 -- 10.0.1.107:0/1005898 == osd.2 10.0.1.199:6806/15018 4 osd_op_reply(123 macm [read 0~310] v0'0 uv46 ondisk = 0) v6 171+0+310 (3762965810 0 1648184722) 0x7f1b500026e0 con 0x2457de0 2014-05-23 09:38:43.057364 7f1b427fc700 1 -- 10.0.1.107:0/1005898 -- 10.0.0.26:6809/4834 -- osd_op(client.72942.0:124 store [call version.read,getxattrs,stat] 5.c5755cee ack+read e279) v4 -- ?+0 0x7f1b66b0 con 0x7f1b440022e0 2014-05-23 09:38:43.059223 7f1b96d80700 1 -- 10.0.1.107:0/1005898 == osd.7 10.0.0.26:6809/4834 37
Re: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean
Try increasing the placement groups for pools ceph osd pool set data pg_num 128 ceph osd pool set data pgp_num 128 similarly for other 2 pools as well. - karan - On 23 May 2014, at 11:50, jan.zel...@id.unibe.ch wrote: Dear ceph, I am trying to setup ceph 0.80.1 with the following components : 1 x mon - Debian Wheezy (i386) 3 x osds - Debian Wheezy (i386) (all are kvm powered) Status after the standard setup procedure : root@ceph-node2:~# ceph -s cluster d079dd72-8454-4b4a-af92-ef4c424d96d8 health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean monmap e1: 1 mons at {ceph-node1=192.168.123.48:6789/0}, election epoch 2, quorum 0 ceph-node1 osdmap e11: 3 osds: 3 up, 3 in pgmap v18: 192 pgs, 3 pools, 0 bytes data, 0 objects 103 MB used, 15223 MB / 15326 MB avail 192 incomplete root@ceph-node2:~# ceph health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean root@ceph-node2:~# ceph osd tree # idweight type name up/down reweight -1 0 root default -2 0 host ceph-node2 0 0 osd.0 up 1 -3 0 host ceph-node3 1 0 osd.1 up 1 -4 0 host ceph-node4 2 0 osd.2 up 1 root@ceph-node2:~# ceph osd dump epoch 11 fsid d079dd72-8454-4b4a-af92-ef4c424d96d8 created 2014-05-23 09:00:08.780211 modified 2014-05-23 09:01:33.438001 flags pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool stripe_width 0 max_osd 3 osd.0 up in weight 1 up_from 4 up_thru 5 down_at 0 last_clean_interval [0,0) 192.168.123.49:6800/11373 192.168.123.49:6801/11373 192.168.123.49:6802/11373 192.168.123.49:6803/11373 exists,up 21a7d2a8-b709-4a28-bc3b-850913fe4c6b osd.1 up in weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.123.50:6800/10542 192.168.123.50:6801/10542 192.168.123.50:6802/10542 192.168.123.50:6803/10542 exists,up c1cd3ad1-b086-438f-a22d-9034b383a1be osd.2 up in weight 1 up_from 11 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.123.53:6800/6962 192.168.123.53:6801/6962 192.168.123.53:6802/6962 192.168.123.53:6803/6962 exists,up aa06d7e4-181c-4d70-bb8e-018b088c5053 What am I doing wrong here ? Or what kind of additional information should be provided to get troubleshooted. thanks, --- Jan P.S. with emperor 0.72.2 I had no such problems ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Screencast/tutorial on setting up Ceph
use my blogs if you like http://karan-mj.blogspot.fi/2013/12/ceph-storage-part-2.html - Karan Singh - On 23 May 2014, at 12:30, jan.zel...@id.unibe.ch jan.zel...@id.unibe.ch wrote: -Ursprüngliche Nachricht- Von: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] Im Auftrag von Sankar P Gesendet: Freitag, 23. Mai 2014 11:14 An: ceph-users@lists.ceph.com Betreff: [ceph-users] Screencast/tutorial on setting up Ceph Hi, I have four old machines lying around. I would like to setup ceph on these machines. Are there any screencast or tutorial with commands, on how to obtain, install and configure on ceph on these machines ? The official documentation page OS Recommendations seem to list only old distros and not the new version of distros (openSUSE and Ubuntu). So I wanted to ask if there is a screencast or tutorial or techtalk on how to setup Ceph for a total newbie ? -- Sankar P http://psankar.blogspot.com Hi, I am rookie too and only used just this : http://ceph.com/docs/master/start/ it's a very nice doc --- jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw Timeout
Thank you very much - I think I've solved the whole thing. It wasn't in radosgw. The solution was, - increase the timeout in Apache conf. - when using haproxy, also increase the timeouts there! Georg On 22.05.2014 15:36, Yehuda Sadeh wrote: On Thu, May 22, 2014 at 6:16 AM, Georg Höllrigl georg.hoellr...@xidras.com wrote: Hello List, Using the radosgw works fine, as long as the amount of data doesn't get too big. I have created one bucket that holds many small files, separated into different directories. But whenever I try to acess the bucket, I only run into some timeout. The timeout is at around 30 - 100 seconds. This is smaller then the Apache timeout of 300 seconds. I've tried to access the bucket with different clients - one thing is s3cmd - which still is able to upload things, but takes rather long time, when listing the contents. Then I've tried with s3fs-fuse - which throws ls: reading directory .: Input/output error Also Cyberduck and S3Browser show a similar behaivor. Is there an option, to only send back maybe 1000 list entries, like Amazon das? So that the client might decide, if he want's to list all the contents? That how it works, it doesn't return more than 1000 entries at once. Are there any timeout values in radosgw? Are you sure the timeout is in the gateway itself? Could be apache that is timing out. Will need to see the apache access logs for these operations, radosgw debug and messenger logs (debug rgw = 20, debug ms = 1), to give a better answer. Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean
64 PG's per pool /shouldn't/ cause any issues while there's only 3 OSD's. It'll be something to pay attention to if a lot more get added through. Your replication setup is probably anything other than host. You'll want to extract your crush map then decompile it and see if your step is set to osd or rack. If it's not host then change it to that and pull it in again. Check the docs on crush maps http://ceph.com/docs/master/rados/operations/crush-map/ for more info. -Michael On 23/05/2014 10:53, Karan Singh wrote: Try increasing the placement groups for pools ceph osd pool set data pg_num 128 ceph osd pool set data pgp_num 128 similarly for other 2 pools as well. - karan - On 23 May 2014, at 11:50, jan.zel...@id.unibe.ch mailto:jan.zel...@id.unibe.ch wrote: Dear ceph, I am trying to setup ceph 0.80.1 with the following components : 1 x mon - Debian Wheezy (i386) 3 x osds - Debian Wheezy (i386) (all are kvm powered) Status after the standard setup procedure : root@ceph-node2:~# ceph -s cluster d079dd72-8454-4b4a-af92-ef4c424d96d8 health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean monmap e1: 1 mons at {ceph-node1=192.168.123.48:6789/0}, election epoch 2, quorum 0 ceph-node1 osdmap e11: 3 osds: 3 up, 3 in pgmap v18: 192 pgs, 3 pools, 0 bytes data, 0 objects 103 MB used, 15223 MB / 15326 MB avail 192 incomplete root@ceph-node2:~# ceph health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean root@ceph-node2:~# ceph osd tree # idweight type name up/down reweight -1 0 root default -2 0 host ceph-node2 0 0 osd.0 up 1 -3 0 host ceph-node3 1 0 osd.1 up 1 -4 0 host ceph-node4 2 0 osd.2 up 1 root@ceph-node2:~# ceph osd dump epoch 11 fsid d079dd72-8454-4b4a-af92-ef4c424d96d8 created 2014-05-23 09:00:08.780211 modified 2014-05-23 09:01:33.438001 flags pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool stripe_width 0 max_osd 3 osd.0 up in weight 1 up_from 4 up_thru 5 down_at 0 last_clean_interval [0,0) 192.168.123.49:6800/11373 192.168.123.49:6801/11373 192.168.123.49:6802/11373 192.168.123.49:6803/11373 exists,up 21a7d2a8-b709-4a28-bc3b-850913fe4c6b osd.1 up in weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.123.50:6800/10542 192.168.123.50:6801/10542 192.168.123.50:6802/10542 192.168.123.50:6803/10542 exists,up c1cd3ad1-b086-438f-a22d-9034b383a1be osd.2 up in weight 1 up_from 11 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.123.53:6800/6962 192.168.123.53:6801/6962 192.168.123.53:6802/6962 192.168.123.53:6803/6962 exists,up aa06d7e4-181c-4d70-bb8e-018b088c5053 What am I doing wrong here ? Or what kind of additional information should be provided to get troubleshooted. thanks, --- Jan P.S. with emperor 0.72.2 I had no such problems ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] collectd / graphite / grafana .. calamari?
Hi Ricardo, Let me share a few notes on metrics in calamari: * We're bundling graphite, and using diamond to send home metrics. The diamond collector used in calamari has always been open source [1]. * The Calamari UI has its own graphs page that talks directly to the graphite API (the calamari REST API does not duplicate any of the graphing interface) * We also bundle the default graphite dashboard, so that folks can go to /graphite/dashboard/ on the calamari server to plot anything custom they want to. It could be quite interesting hook in Grafana there in the same way that we currently hook in the default graphite dashboard, as it grafana definitely nicer and would give us a roadmap to influxdb (a project I am quite excited about). Cheers, John 1. https://github.com/ceph/Diamond/commits/calamari On Fri, May 23, 2014 at 1:58 AM, Ricardo Rocha rocha.po...@gmail.com wrote: Hi. I saw the thread a couple days ago on ceph-users regarding collectd... and yes, i've been working on something similar for the last few days :) https://github.com/rochaporto/collectd-ceph It has a set of collectd plugins pushing metrics which mostly map what the ceph commands return. In the setup we have it pushes them to graphite and the displays rely on grafana (check for a screenshot in the link above). As it relies on common building blocks, it's easily extensible and we'll come up with new dashboards soon - things like plotting osd data against the metrics from the collectd disk plugin, which we also deploy. This email is mostly to share the work, but also to check on Calamari? I asked Patrick after the RedHat/Inktank news and have no idea what it provides, but i'm sure it comes with lots of extra sauce - he suggested to ask in the list. What's the timeline to have it open sourced? It would be great to have a look at it, and as there's work from different people in this area maybe start working together on some fancier monitoring tools. Regards, Ricardo -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw Timeout
On 22.05.2014 17:30, Craig Lewis wrote: On 5/22/14 06:16 , Georg Höllrigl wrote: I have created one bucket that holds many small files, separated into different directories. But whenever I try to acess the bucket, I only run into some timeout. The timeout is at around 30 - 100 seconds. This is smaller then the Apache timeout of 300 seconds. Just so we're all talking about the same things, what does many small files mean to you? Also, how are you separating them into directories? Are you just giving files in the same directory the same leading string, like dir1_subdir1_filename? I can only estimate how many files. ATM I've 25M files on the origin but only 1/10th has been synced to radosgw. These are distributed throuhg 20 folders, each containing about 2k directories with ~ 100 - 500 files each. Do you think that's too much in that usecase? I'm putting about 1M objects, random sizes, in each bucket. I'm not having problems getting individual files, or uploading new ones. It does take a long time for s3cmd to list the contents of the bucket. The only time I get timeouts is when my cluster is very unhealthy. If you're doing a lot more than that, say 10M or 100M objects, then that could cause a hot spot on disk. You might be better off taking your directories, and putting them in their own bucket. -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Occasional Missing Admin Sockets
Hi Mike, Sorry I missed this message. Are you able to reproduce the problem ? Does it always happen when you logrotate --force or only sometimes ? Cheers On 13/05/2014 21:23, Gregory Farnum wrote: Yeah, I just did so. :( -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, May 13, 2014 at 11:41 AM, Mike Dawson mike.daw...@cloudapt.com wrote: Greg/Loic, I can confirm that logrotate --force /etc/logrotate.d/ceph removes the monitor admin socket on my boxes running 0.80.1 just like the description in Issue 7188 [0]. 0: http://tracker.ceph.com/issues/7188 Should that bug be reopened? Thanks, Mike Dawson On 5/13/2014 2:10 PM, Gregory Farnum wrote: On Tue, May 13, 2014 at 9:06 AM, Mike Dawson mike.daw...@cloudapt.com wrote: All, I have a recurring issue where the admin sockets (/var/run/ceph/ceph-*.*.asok) may vanish on a running cluster while the daemons keep running Hmm. (or restart without my knowledge). I'm guessing this might be involved: I see this issue on a dev cluster running Ubuntu and Ceph Emperor/Firefly, deployed with ceph-deploy using Upstart to control daemons. I never see this issue on Ubuntu / Dumpling / sysvinit. *goes and greps the git log* I'm betting it was commit 45600789f1ca399dddc5870254e5db883fb29b38 (which has, in fact, been backported to dumpling and emperor), intended so that turning on a new daemon wouldn't remove the admin socket of an existing one. But I think that means that if you activate the new daemon before the old one has finished shutting down and unlinking, you would end up with a daemon that had no admin socket. Perhaps it's an incomplete fix and we need a tracker ticket? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Occasional Missing Admin Sockets
On 13/05/2014 20:10, Gregory Farnum wrote: On Tue, May 13, 2014 at 9:06 AM, Mike Dawson mike.daw...@cloudapt.com wrote: All, I have a recurring issue where the admin sockets (/var/run/ceph/ceph-*.*.asok) may vanish on a running cluster while the daemons keep running Hmm. (or restart without my knowledge). I'm guessing this might be involved: I see this issue on a dev cluster running Ubuntu and Ceph Emperor/Firefly, deployed with ceph-deploy using Upstart to control daemons. I never see this issue on Ubuntu / Dumpling / sysvinit. *goes and greps the git log* I'm betting it was commit 45600789f1ca399dddc5870254e5db883fb29b38 (which has, in fact, been backported to dumpling and emperor), intended so that turning on a new daemon wouldn't remove the admin socket of an existing one. But I think that means that if you activate the new daemon before the old one has finished shutting down and unlinking, you would end up with a daemon that had no admin socket. Perhaps it's an incomplete fix and we need a tracker ticket? https://github.com/ceph/ceph/commit/45600789f1ca399dddc5870254e5db883fb29b38 I see the race condition now, missed it the first time around, thanks Greg :-) I'll work on it. Cheers -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean
Hi, if you use debian, try to use a recent kernel from backport (3.10) also check your libleveldb1 version, it should be 1.9.0-1~bpo70+1 (debian wheezy version is too old) I don't see it in ceph repo: http://ceph.com/debian-firefly/pool/main/l/leveldb/ (only for squeeze ~bpo60+1) but you can take it from our proxmox repository http://download.proxmox.com/debian/dists/wheezy/pve-no-subscription/binary-amd64/libleveldb1_1.9.0-1~bpo70+1_amd64.deb - Mail original - De: jan zeller jan.zel...@id.unibe.ch À: ceph-users@lists.ceph.com Envoyé: Vendredi 23 Mai 2014 10:50:40 Objet: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean Dear ceph, I am trying to setup ceph 0.80.1 with the following components : 1 x mon - Debian Wheezy (i386) 3 x osds - Debian Wheezy (i386) (all are kvm powered) Status after the standard setup procedure : root@ceph-node2:~# ceph -s cluster d079dd72-8454-4b4a-af92-ef4c424d96d8 health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean monmap e1: 1 mons at {ceph-node1=192.168.123.48:6789/0}, election epoch 2, quorum 0 ceph-node1 osdmap e11: 3 osds: 3 up, 3 in pgmap v18: 192 pgs, 3 pools, 0 bytes data, 0 objects 103 MB used, 15223 MB / 15326 MB avail 192 incomplete root@ceph-node2:~# ceph health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean root@ceph-node2:~# ceph osd tree # id weight type name up/down reweight -1 0 root default -2 0 host ceph-node2 0 0 osd.0 up 1 -3 0 host ceph-node3 1 0 osd.1 up 1 -4 0 host ceph-node4 2 0 osd.2 up 1 root@ceph-node2:~# ceph osd dump epoch 11 fsid d079dd72-8454-4b4a-af92-ef4c424d96d8 created 2014-05-23 09:00:08.780211 modified 2014-05-23 09:01:33.438001 flags pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool stripe_width 0 max_osd 3 osd.0 up in weight 1 up_from 4 up_thru 5 down_at 0 last_clean_interval [0,0) 192.168.123.49:6800/11373 192.168.123.49:6801/11373 192.168.123.49:6802/11373 192.168.123.49:6803/11373 exists,up 21a7d2a8-b709-4a28-bc3b-850913fe4c6b osd.1 up in weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.123.50:6800/10542 192.168.123.50:6801/10542 192.168.123.50:6802/10542 192.168.123.50:6803/10542 exists,up c1cd3ad1-b086-438f-a22d-9034b383a1be osd.2 up in weight 1 up_from 11 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.123.53:6800/6962 192.168.123.53:6801/6962 192.168.123.53:6802/6962 192.168.123.53:6803/6962 exists,up aa06d7e4-181c-4d70-bb8e-018b088c5053 What am I doing wrong here ? Or what kind of additional information should be provided to get troubleshooted. thanks, --- Jan P.S. with emperor 0.72.2 I had no such problems ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about osd objectstore = keyvaluestore-dev setting
Hello Greg and Haomai, Thanks for the answers. I was trying to implement the osd leveldb backend at an existing ceph test cluster. At the moment i am removing the osd's one by one and recreate them with the objectstore = keyvaluestore-dev option in place in ceph.conf. This works fine and the backend is leveldb now for the new osd's. The leveldb backend looks more efficient. The error gave me the idea that migrating from non-leveldb backend osd to new type leveldb was possible. Will online migration of existings osd's be added in the future? Thanks, Geert On 05/23/2014 11:31 AM, GMail wrote: implement the osd leveldb backend at an existing ceph test cluster. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean
-Ursprüngliche Nachricht- Von: Alexandre DERUMIER [mailto:aderum...@odiso.com] Gesendet: Freitag, 23. Mai 2014 13:20 An: Zeller, Jan (ID) Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean Hi, if you use debian, try to use a recent kernel from backport (3.10) also check your libleveldb1 version, it should be 1.9.0-1~bpo70+1 (debian wheezy version is too old) I don't see it in ceph repo: http://ceph.com/debian-firefly/pool/main/l/leveldb/ (only for squeeze ~bpo60+1) but you can take it from our proxmox repository http://download.proxmox.com/debian/dists/wheezy/pve-no- subscription/binary-amd64/libleveldb1_1.9.0-1~bpo70+1_amd64.deb thanks Alexandre, due to this I'll try the whole setup on Ubuntu 12.04. May be it's going to be a bit more easier... --- jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean
thanks Alexandre, due to this I'll try the whole setup on Ubuntu 12.04. May be it's going to be a bit more easier... Yes,I think you can use last ubuntu lts, I think ceph 0.79 is officialy supported, so it should not be a problem for firefly. - Mail original - De: jan zeller jan.zel...@id.unibe.ch À: aderum...@odiso.com Cc: ceph-users@lists.ceph.com Envoyé: Vendredi 23 Mai 2014 13:36:04 Objet: AW: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean -Ursprüngliche Nachricht- Von: Alexandre DERUMIER [mailto:aderum...@odiso.com] Gesendet: Freitag, 23. Mai 2014 13:20 An: Zeller, Jan (ID) Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean Hi, if you use debian, try to use a recent kernel from backport (3.10) also check your libleveldb1 version, it should be 1.9.0-1~bpo70+1 (debian wheezy version is too old) I don't see it in ceph repo: http://ceph.com/debian-firefly/pool/main/l/leveldb/ (only for squeeze ~bpo60+1) but you can take it from our proxmox repository http://download.proxmox.com/debian/dists/wheezy/pve-no- subscription/binary-amd64/libleveldb1_1.9.0-1~bpo70+1_amd64.deb thanks Alexandre, due to this I'll try the whole setup on Ubuntu 12.04. May be it's going to be a bit more easier... --- jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to create authentication signature for getting user details
Hi All, I would like to create a function for getting the user details by passing a user id ( id) using php and curl. I am planning to pass the user id as 'admin' ( admin is a user which is already there ) and get the details of that user. Could you please tell me how we can create the authentication signature for this ? I tried the way as like in http://mashupguide.net/1.0/html/ch16s05.xhtml#ftn.d0e27318 but its not working and getting a Failed to authenticate error ( this is because the signature is not generating properly ) If anyone knows a proper ways to generate authentication signature using php, please help me to solve this. -- Regards Shanil ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about osd objectstore = keyvaluestore-dev setting
Best Wishes! 在 2014年5月23日,19:27,Geert Lindemulder glindemul...@snow.nl 写道: Hello Greg and Haomai, Thanks for the answers. I was trying to implement the osd leveldb backend at an existing ceph test cluster. At the moment i am removing the osd's one by one and recreate them with the objectstore = keyvaluestore-dev option in place in ceph.conf. This works fine and the backend is leveldb now for the new osd's. The leveldb backend looks more efficient. Happy to see it, although I'm still try to improve performance for some workloads. The error gave me the idea that migrating from non-leveldb backend osd to new type leveldb was possible. Will online migration of existings osd's be added in the future? Still not, I think it's a good feature. We can implement it at ObjectStore class and simply convert one type to another Thanks, Geert On 05/23/2014 11:31 AM, GMail wrote: implement the osd leveldb backend at an existing ceph test cluster. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to get Object ID ?
I want to know/read Objet ID assigned by ceph to file which I transfered via crossftp. How can I read 64bit Object ID? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osd pool default pg num problem
In Firefly, I added below lines to [global] section in ceph.conf, however, after creating the cluster, the default pool “metadata/data/rbd”’s pg num is still over 900 but not 375. Any suggestion? osd pool default pg num = 375 osd pool default pgp num = 375 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Pool snaps
Hi! I can't find any information about ceph osd pool snapshots, except for the commands mksnap and rmsnap. What features does snapshots enable? Can I do things such as diff-export/import just like rbd can? Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean
Thanks for your tips tricks. This setup is now based on ubuntu 12.04, ceph version 0.80.1 Still using 1 x mon 3 x osds root@ceph-node2:~# ceph osd tree # idweight type name up/downreweight -10 root default -20 host ceph-node2 0 0 osd.0 up 1 -30 host ceph-node3 1 0 osd.1 up 1 -40 host ceph-node1 2 0 osd.2 up 1 root@ceph-node2:~# ceph -s cluster c30e1410-fe1a-4924-9112-c7a5d789d273 health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean monmap e1: 1 mons at {ceph-node1=192.168.123.48:6789/0}, election epoch 2, quorum 0 ceph-node1 osdmap e11: 3 osds: 3 up, 3 in pgmap v18: 192 pgs, 3 pools, 0 bytes data, 0 objects 102 MB used, 15224 MB / 15326 MB avail 192 incomplete root@ceph-node2:~# cat mycrushmap.txt # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host ceph-node2 { id -2 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.0 weight 0.000 } host ceph-node3 { id -3 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.1 weight 0.000 } host ceph-node1 { id -4 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.2 weight 0.000 } root default { id -1 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item ceph-node2 weight 0.000 item ceph-node3 weight 0.000 item ceph-node1 weight 0.000 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map Is there anything wrong with it ? root@ceph-node2:~# ceph osd dump epoch 11 fsid c30e1410-fe1a-4924-9112-c7a5d789d273 created 2014-05-23 15:16:57.772981 modified 2014-05-23 15:18:17.022152 flags pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool stripe_width 0 max_osd 3 osd.0 up in weight 1 up_from 4 up_thru 5 down_at 0 last_clean_interval [0,0) 192.168.123.49:6800/4714 192.168.123.49:6801/4714 192.168.123.49:6802/4714 192.168.123.49:6803/4714 exists,up bc991a4b-9e60-4759-b35a-7f58852aa804 osd.1 up in weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.123.50:6800/4685 192.168.123.50:6801/4685 192.168.123.50:6802/4685 192.168.123.50:6803/4685 exists,up bd099d83-2483-42b9-9dbc-7f4e4043ca60 osd.2 up in weight 1 up_from 11 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.123.53:6800/16807 192.168.123.53:6801/16807 192.168.123.53:6802/16807 192.168.123.53:6803/16807 exists,up 80a302d0-3493-4c39-b34b-5af233b32ba1 thanks Von: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] Im Auftrag von Michael Gesendet: Freitag, 23. Mai 2014 12:36 An: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] pgs incomplete; pgs stuck inactive; pgs stuck unclean 64 PG's per pool shouldn't cause any issues while there's only 3 OSD's. It'll be something to pay attention to if a lot more get added through. Your replication setup is probably anything other than host. You'll want to extract your crush map then decompile it and see if your step is set to osd or rack. If it's not host then change it to that and pull it in again. Check the docs on
Re: [ceph-users] Unable to update Swift ACL's on existing containers
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Yehuda On 23/05/14 02:25, Yehuda Sadeh wrote: That looks like a bug; generally the permission checks there are broken. I opened issue #8428, and pushed a fix on top of the firefly branch to wip-8428. I cherry picked the fix and tested - LGTM. Thanks for the quick fix. Cheers James - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJTf06vAAoJEL/srsug59jDXokP/3FREIK0HPOl9ZvA3d+y+XUx J6v5mR9BMzVpY4yE4VIB7iZB7FiOPk9McqUSacmDhYvBy1KEwA92NYcF8G79GMiI eNOTYFh0hAg3Lw+y79X8jJ4eWlw2NJGyqsm84UfkOLYOTIPCBOzeqv8X9tVUhChv k20rEmIb0HBJnLp6gScyTrNgX1csOu2MdK+3/GlLeV8MiQJscea8lkbehDhdIJDj FzLfTxPi2tFM8vfR1O/zvcotsWSq1xq2HdXcM1KTIJukMF++mfH6pHMUGthSCUzF /g7DETg+IkGL3crxoZSDODztFR/Q7tD7KCKbd5jH29za11fvhZy9ZamcfJp7gsem G70NYm3gC2kGnFu9A06IBNlwjDDTCzr1cTpdk2xi+kzGBqfshbJ4ppGvnQIypb29 689xXvwLJpIPAR56EGRlxY4W88z7E5krX72XcBTNsrIZP/KvrpKxSMgEhj8N4xZu o3PVZlkMUJ8sOfDG5tWQRF7Nas6AyFhHodBW3vWtykkmW/+aI5dBCMMpm6QoNlMu 8WTGReqs6Skv/kxrpwmhlNLtl9JYU6xrF42/MKKg5zy6pxvRIffSqWV+oy9MdISb hmtTCHTA9Fuj0/n/nUOCi3ZAwroEzcFwknYTivHiTLDaFu7u2eSl28sAczCQ2vie bWYkBOn4FLvFtlnJ2kPF =m4xJ -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd pool default pg num problem
Those settings are applied when creating new pools with osd pool create, but not to the pools that are created automatically during cluster setup. We've had the same question before (http://comments.gmane.org/gmane.comp.file-systems.ceph.user/8150), so maybe it's worth opening a ticket to do something about it. Cheers, John On Fri, May 23, 2014 at 2:01 PM, Cao, Buddy buddy@intel.com wrote: In Firefly, I added below lines to [global] section in ceph.conf, however, after creating the cluster, the default pool “metadata/data/rbd”’s pg num is still over 900 but not 375. Any suggestion? osd pool default pg num = 375 osd pool default pgp num = 375 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to find the disk partitions attached to a OSD
Thanks:-) That helped. Thanks Regards, Sharmila On Thu, May 22, 2014 at 6:41 PM, Alfredo Deza alfredo.d...@inktank.comwrote: Hopefully I am not late to the party :) But ceph-deploy recently gained a `osd list` subcommand that does this plus a bunch of other interesting metadata: $ ceph-deploy osd list node1 [ceph_deploy.conf][DEBUG ] found configuration file at: /Users/alfredo/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.2): /Users/alfredo/.virtualenvs/ceph-deploy/bin/ceph-deploy osd list node1 [node1][DEBUG ] connected to host: node1 [node1][DEBUG ] detect platform information from remote host [node1][DEBUG ] detect machine type [node1][INFO ] Running command: sudo ceph --cluster=ceph osd tree --format=json [node1][DEBUG ] connected to host: node1 [node1][DEBUG ] detect platform information from remote host [node1][DEBUG ] detect machine type [node1][INFO ] Running command: sudo ceph-disk list [node1][INFO ] [node1][INFO ] ceph-0 [node1][INFO ] [node1][INFO ] Path /var/lib/ceph/osd/ceph-0 [node1][INFO ] ID 0 [node1][INFO ] Name osd.0 [node1][INFO ] Status up [node1][INFO ] Reweight 1.00 [node1][INFO ] Magic ceph osd volume v026 [node1][INFO ] Journal_uuid 214a6865-416b-4c09-b031-a354d4f8bdff [node1][INFO ] Active ok [node1][INFO ] Device /dev/sdb1 [node1][INFO ] Whoami 0 [node1][INFO ] Journal path /dev/sdb2 [node1][INFO ] On Thu, May 22, 2014 at 8:30 AM, John Spray john.sp...@inktank.com wrote: On Thu, May 22, 2014 at 10:57 AM, Sharmila Govind sharmilagov...@gmail.com wrote: root@cephnode4:/mnt/ceph/osd2# mount |grep ceph /dev/sdc on /mnt/ceph/osd3 type ext4 (rw) /dev/sdb on /mnt/ceph/osd2 type ext4 (rw) All the above commands just pointed out the mount points(/mnt/ceph/osd3), the folders were named by me as ceph/osd. But, if a new user has to get the osd mapping to the mounted devices, would be difficult if we named the osd disk folders differently. Any other command which could give the mapping would be useful. It really depends on how you have set up the OSDs. If you're using ceph-deploy or ceph-disk to partition and format the drives, they get a special partition type set which marks them as a Ceph OSD. On a system set up that way, you get nice uniform output like this: # ceph-disk list /dev/sda : /dev/sda1 other, ext4, mounted on /boot /dev/sda2 other, LVM2_member /dev/sdb : /dev/sdb1 ceph data, active, cluster ceph, osd.0, journal /dev/sdb2 /dev/sdb2 ceph journal, for /dev/sdb1 /dev/sdc : /dev/sdc1 ceph data, active, cluster ceph, osd.3, journal /dev/sdc2 /dev/sdc2 ceph journal, for /dev/sdc1 John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to backup mon-data?
Hello, I’m running a 3 node cluster with 2 hdd/osd and one mon on each node. Sadly the fsyncs done by mon-processes eat my hdd. I was able to disable this impact by moving the mon-data-dir to ramfs. This should work until at least 2 nodes are running, but I want to implement some kind of disaster recover. What’s the correct way to backup mon-data - if there is any? Thanks, Fabian signature.asc Description: Message signed with OpenPGP using GPGMail ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup mon-data?
Hi, I think you’re rather brave (sorry, foolish) to store the mon data dir in ramfs. One power outage and your cluster is dead. Even with good backups of the data dir I wouldn't want to go through that exercise. Saying that, we had a similar disk-io-bound problem with the mon data dirs, and solved it by moving the mons to SSDs. Maybe in your case using the cfq io scheduler would help, since at least then the OSD and MON processes would get fair shares of the disk IOs. Anyway, to backup the data dirs, you need to stop the mon daemon to get a consistent leveldb before copying the data to a safe place. Cheers, Dan -- Dan van der Ster || Data Storage Services || CERN IT Department -- On 23 May 2014, at 15:45, Fabian Zimmermann f.zimmerm...@xplosion.de wrote: Hello, I’m running a 3 node cluster with 2 hdd/osd and one mon on each node. Sadly the fsyncs done by mon-processes eat my hdd. I was able to disable this impact by moving the mon-data-dir to ramfs. This should work until at least 2 nodes are running, but I want to implement some kind of disaster recover. What’s the correct way to backup mon-data - if there is any? Thanks, Fabian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] network Ports Linked to each OSD process
Hi, Iam trying to do some network control on the storage nodes. For this, I need to know the ports opened for communication by each OSD processes. I got to know from the link http://ceph.com/docs/master/rados/configuration/network-config-ref/ , that each OSD process requires 3 ports and they from port 6800 it is reserved for OSD processes. However, when I do a ceph osd dump command, it lists 4 ports in use for each of the OSDs: *root@cephnode2:~# ceph osd dump | grep osd* *max_osd 4* *osd.0 up in weight 1 up_from 71 up_thru 71 down_at 68 last_clean_interval [4,70) 10.223.169.166:6800/83380 http://10.223.169.166:6800/83380 10.223.169.166:6810/1083380 http://10.223.169.166:6810/1083380 10.223.169.166:6811/1083380 http://10.223.169.166:6811/1083380 10.223.169.166:6812/1083380 http://10.223.169.166:6812/1083380 exists,up fdbbc6eb-7d9f-4ad8-a8c3-caf995422528* *osd.1 up in weight 1 up_from 7 up_thru 71 down_at 0 last_clean_interval [0,0) 10.223.169.201:6800/83569 http://10.223.169.201:6800/83569 10.223.169.201:6801/83569 http://10.223.169.201:6801/83569 10.223.169.201:6802/83569 http://10.223.169.201:6802/83569 10.223.169.201:6803/83569 http://10.223.169.201:6803/83569 exists,up db545fd7-071f-4671-b1c4-c57221f894a3* *osd.2 up in weight 1 up_from 64 up_thru 64 down_at 61 last_clean_interval [12,60) 10.223.169.166:6805/92402 http://10.223.169.166:6805/92402 10.223.169.166:6806/92402 http://10.223.169.166:6806/92402 10.223.169.166:6807/92402 http://10.223.169.166:6807/92402 10.223.169.166:6808/92402 http://10.223.169.166:6808/92402 exists,up 594b73b9-1908-4757-b914-d887d850b386* *osd.3 up in weight 1 up_from 17 up_thru 71 down_at 0 last_clean_interval [0,0) 10.223.169.201:6805/84590 http://10.223.169.201:6805/84590 10.223.169.201:6806/84590 http://10.223.169.201:6806/84590 10.223.169.201:6807/84590 http://10.223.169.201:6807/84590 10.223.169.201:6808/84590 http://10.223.169.201:6808/84590 exists,up 37536050-ef92-4eba-95a7-e7a099c6d059* *root@cephnode2:~# * I also, listed the ports listening on the above highlighted OSD process using lsof root@cephnode2:~/nethogs# lsof -i | grep ceph | grep 83380 ntpd1627 ntp 19u IPv4 33890 0t0 UDP cephnode2.iind.intel.com:ntp *ceph-osd 83380 root4u IPv4 4881747 0t0 TCP *:6800 (LISTEN)* *ceph-osd 83380 root5u IPv4 5045544 0t0 TCP cephnode2.iind.intel.com:6810 http://cephnode2.iind.intel.com:6810 (LISTEN)* *ceph-osd 83380 root6u IPv4 5045545 0t0 TCP cephnode2.iind.intel.com:6811 http://cephnode2.iind.intel.com:6811 (LISTEN)* *ceph-osd 83380 root7u IPv4 5045546 0t0 TCP cephnode2.iind.intel.com:6812 http://cephnode2.iind.intel.com:6812 (LISTEN)* *ceph-osd 83380 root8u IPv4 4881751 0t0 TCP *:6804 (LISTEN)* ceph-osd 83380 root 19u IPv4 5101954 0t0 TCP cephnode2.iind.intel.com:6800-computeich.iind.intel.com:60781 (ESTABLISHED) ceph-osd 83380 root 23u IPv4 5013387 0t0 TCP cephnode2.iind.intel.com:41878-cephnode4.iind.intel.com:6803 (ESTABLISHED) ceph-osd 83380 root 25u IPv4 5037728 0t0 TCP cephnode2.iind.intel.com:44251-cephnode4.iind.intel.com:6802 (ESTABLISHED) ceph-osd 83380 root 83u IPv4 5025954 0t0 TCP cephnode2.iind.intel.com:47863-cephnode4.iind.intel.com:6808 (ESTABLISHED) ceph-osd 83380 root 111u IPv4 4850005 0t0 TCP cephnode2.iind.intel.com:43189-cephnode2.iind.intel.com:6807 (ESTABLISHED) ceph-osd 83380 root 112u IPv4 4850839 0t0 TCP cephnode2.iind.intel.com:59738-cephnode2.iind.intel.com:6808 (ESTABLISHED) ceph-osd 83380 root 130u IPv4 5037729 0t0 TCP cephnode2.iind.intel.com:41902-cephnode4.iind.intel.com:6807 (ESTABLISHED) ceph-osd 83380 root 152u IPv4 5013621 0t0 TCP cephnode2.iind.intel.com:34798-cephmon.iind.intel.com:6789 (ESTABLISHED) ceph-osd 83380 root 159u IPv4 5040569 0t0 TCP cephnode2.iind.intel.com:6811-cephnode4.iind.intel.com:35321 (ESTABLISHED) ceph-osd 83380 root 160u IPv4 5040570 0t0 TCP cephnode2.iind.intel.com:6812-cephnode4.iind.intel.com:42682 (ESTABLISHED) ceph-osd 83380 root 161u IPv4 5043767 0t0 TCP cephnode2.iind.intel.com:6812-cephnode4.iind.intel.com:42683 (ESTABLISHED) ceph-osd 83380 root 162u IPv4 5038664 0t0 TCP cephnode2.iind.intel.com:6811-cephnode4.iind.intel.com:35324 (ESTABLISHED) In the above list, it looks like it is listening to some additional ports(6810-6812) from what is listed in the ceph osd dump command. I would like to know, if there is any straight way of listing the ports used by each OSD process. Also, I would also like to understand the networking architecture of Ceph in more detail. Is there any link/doc for the same? Thanks in Advance, Sharmila ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup mon-data?
On 05/23/2014 04:09 PM, Dan Van Der Ster wrote: Hi, I think you’re rather brave (sorry, foolish) to store the mon data dir in ramfs. One power outage and your cluster is dead. Even with good backups of the data dir I wouldn't want to go through that exercise. Agreed. Foolish. I'd never do that. Saying that, we had a similar disk-io-bound problem with the mon data dirs, and solved it by moving the mons to SSDs. Maybe in your case using the cfq io scheduler would help, since at least then the OSD and MON processes would get fair shares of the disk IOs. Anyway, to backup the data dirs, you need to stop the mon daemon to get a consistent leveldb before copying the data to a safe place. I wrote a blog about this: http://blog.widodh.nl/2014/03/safely-backing-up-your-ceph-monitors/ Wido Cheers, Dan -- Dan van der Ster || Data Storage Services || CERN IT Department -- On 23 May 2014, at 15:45, Fabian Zimmermann f.zimmerm...@xplosion.de wrote: Hello, I’m running a 3 node cluster with 2 hdd/osd and one mon on each node. Sadly the fsyncs done by mon-processes eat my hdd. I was able to disable this impact by moving the mon-data-dir to ramfs. This should work until at least 2 nodes are running, but I want to implement some kind of disaster recover. What’s the correct way to backup mon-data - if there is any? Thanks, Fabian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
For what it's worth (very little in my case)... Since the cluster wasn't in production yet and Firefly (0.80.1) did hit Debian Jessie today I upgraded it. Big mistake... I did the recommended upgrade song and dance, MONs first, OSDs after that. Then applied ceph osd crush tunables default as per the update instructions and since ceph -s was whining about it. Lastly I did a ceph osd pool set rbd hashpspool true and after that was finished (people with either a big cluster or slow network probably should avoid this like the plague) I re-ran the below fio from a VM (old or new client libraries made no difference) again. The result, 2800 write IOPS instead of 3200 with Emperor. So much for improved latency and whatnot... Christian On Wed, 14 May 2014 21:33:06 +0900 Christian Balzer wrote: Hello! On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote: Hi Christian, I missed this thread, haven't been reading the list that well the last weeks. You already know my setup, since we discussed it in an earlier thread. I don't have a fast backing store, but I see the slow IOPS when doing randwrite inside the VM, with rbd cache. Still running dumpling here though. Nods, I do recall that thread. A thought struck me that I could test with a pool that consists of OSDs that have tempfs-based disks, think I have a bit more latency than your IPoIB but I've pushed 100k IOPS with the same network devices before. This would verify if the problem is with the journal disks. I'll also try to run the journal devices in tempfs as well, as it would test purely Ceph itself. That would be interesting indeed. Given what I've seen (with the journal at 20% utilization and the actual filestore ataround 5%) I'd expect Ceph to be the culprit. I'll get back to you with the results, hopefully I'll manage to get them done during this night. Looking forward to that. ^^ Christian Cheers, Josef On 13/05/14 11:03, Christian Balzer wrote: I'm clearly talking to myself, but whatever. For Greg, I've played with all the pertinent journal and filestore options and TCP nodelay, no changes at all. Is there anybody on this ML who's running a Ceph cluster with a fast network and FAST filestore, so like me with a big HW cache in front of a RAID/JBODs or using SSDs for final storage? If so, what results do you get out of the fio statement below per OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which is of course vastly faster than the normal indvidual HDDs could do. So I'm wondering if I'm hitting some inherent limitation of how fast a single OSD (as in the software) can handle IOPS, given that everything else has been ruled out from where I stand. This would also explain why none of the option changes or the use of RBD caching has any measurable effect in the test case below. As in, a slow OSD aka single HDD with journal on the same disk would clearly benefit from even the small 32MB standard RBD cache, while in my test case the only time the caching becomes noticeable is if I increase the cache size to something larger than the test data size. ^o^ On the other hand if people here regularly get thousands or tens of thousands IOPS per OSD with the appropriate HW I'm stumped. Christian On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: Oh, I didn't notice that. I bet you aren't getting the expected throughput on the RAID array with OSD access patterns, and that's applying back pressure on the journal. In the a picture being worth a thousand words tradition, I give you this iostat -x output taken during a fio run: avg-cpu: %user %nice %system %iowait %steal %idle 50.820.00 19.430.170.00 29.58 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.0051.500.00 1633.50 0.00 7460.00 9.13 0.180.110.000.11 0.01 1.40 sdb 0.00 0.000.00 1240.50 0.00 5244.00 8.45 0.30 0.250.000.25 0.02 2.00 sdc 0.00 5.00 0.00 2468.50 0.00 13419.0010.87 0.240.100.00 0.10 0.09 22.00 sdd 0.00 6.500.00 1913.00 0.00 10313.0010.78 0.200.100.000.10 0.09 16.60 The %user CPU utilization is pretty much entirely the 2 OSD processes, note the nearly complete absence of iowait. sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. Look at these numbers, the lack of queues, the low wait and service times (this is in ms) plus overall utilization. The only conclusion I can draw from these numbers and the network results below is that the latency happens
Re: [ceph-users] How to backup mon-data?
Hi, Am 23.05.2014 um 16:09 schrieb Dan Van Der Ster daniel.vanders...@cern.ch: Hi, I think you’re rather brave (sorry, foolish) to store the mon data dir in ramfs. One power outage and your cluster is dead. Even with good backups of the data dir I wouldn't want to go through that exercise. I know - I’m still testing my env and I don’t really plan to use ramfs in prod, but technically it’s quite interesting ;) Saying that, we had a similar disk-io-bound problem with the mon data dirs, and solved it by moving the mons to SSDs. Maybe in your case using the cfq io scheduler would help, since at least then the OSD and MON processes would get fair shares of the disk IOs. Oh, when did they switch the default sched to deadline? Thanks for the hint, moved to cfq - tests are running. Anyway, to backup the data dirs, you need to stop the mon daemon to get a consistent leveldb before copying the data to a safe place. Well, this wouldn’t be a real problem, but I’m worrying about how effective this would be? Is it enough to restore such a backup even if in the meantime (since the backup was done) data-objects have changed? I don’t think so :( Conclude: * ceph would stop/freeze as soon as amount of nodes is less than quorum * ceph would continue to work as soon as node go up again * I could create a fresh mon on every node directly on boot by importing current state ceph-mon --force-sync --yes-i-really-mean-it ... So, as long as there are enough mon to build the quorum, it should work with ramfs. If nodes fail one by one, ceph would stop if quorum is lost and continue if nodes are back. But if all nodes stop (f.e. poweroutage) my ceph-cluster is dead and backups wouldn’t prevent this, isn’t it? Maybe snapshotting the pool could help? Backup: * create a snapshot * shutdown one mon * backup mon-dir Restore: * import mon-dir * create further mons until quorum is restored * restore snapshot Possible?.. :D Thanks, Fabian signature.asc Description: Message signed with OpenPGP using GPGMail ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup mon-data?
Hi, Am 23.05.2014 um 17:31 schrieb Wido den Hollander w...@42on.com: I wrote a blog about this: http://blog.widodh.nl/2014/03/safely-backing-up-your-ceph-monitors/ so you assume restoring the old data is working, or did you proof this? Fabian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Day Boston Schedule Released
Hey cephers, Just wanted to let you know that the schedule has been posted for Ceph Day Boston happening on 10 June at the Sheraton Boston, MA: http://www.inktank.com/cephdays/boston/ There are still a couple of talk title tweaks that are pending, but I wanted to get the info out as soon as possible. We have some really solid speakers, including a couple of highly technical talks from the CohortFS guys and a demo of one of the hot new ethernet drives that is poised to take the market by storm. If you haven't signed up yet, please don't wait! We want to make sure we can adequately accommodate everyone that wishes to attend. Thanks, and see you there! Best Regards, Patrick McGarry Director, Community || Inktank http://ceph.com || http://inktank.com @scuttlemonkey || @ceph || @inktank ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] centos and 'print continue' support
Yesterday I went through manually configuring a ceph cluster with a rados gateway on centos 6.5, and I have a question about the documentation. On this page: https://ceph.com/docs/master/radosgw/config/ It mentions On CentOS/RHEL distributions, turn off print continue. If you have it set to true, you may encounter problems with PUT operations. However, when I had 'rgw print continue = false' in my ceph.conf, adding objects with the python boto module would hang at: key.set_contents_from_string('Hello World!') After switching it to 'rgw print continue = true' things started working. I'm wondering if this is because I installed the custom apache/mod_fastcgi packages from the instructions on this page?: http://ceph.com/docs/master/install/install-ceph-gateway/#id2 If that's the case, could the docs be updated to mention that setting 'rgw print continue = false' is only needed if you're using the distro packages? Thanks, Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd pool default pg num problem
The other thing to note, too, is that it appears you're trying to decrease the PG/PGP_num parameters, which is not supported. In order to decrease those settings, you'll need to delete and recreate the pools. All new pools created will use the settings defined in the ceph.conf file. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John Spray Sent: Friday, May 23, 2014 6:38 AM To: Cao, Buddy Cc: ceph-users@lists.ceph.com; ceph-u...@ceph.com Subject: Re: [ceph-users] osd pool default pg num problem Those settings are applied when creating new pools with osd pool create, but not to the pools that are created automatically during cluster setup. We've had the same question before (http://comments.gmane.org/gmane.comp.file-systems.ceph.user/8150), so maybe it's worth opening a ticket to do something about it. Cheers, John On Fri, May 23, 2014 at 2:01 PM, Cao, Buddy buddy@intel.com wrote: In Firefly, I added below lines to [global] section in ceph.conf, however, after creating the cluster, the default pool “metadata/data/rbd”’s pg num is still over 900 but not 375. Any suggestion? osd pool default pg num = 375 osd pool default pgp num = 375 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Designing a cluster with ceph and benchmark (ceph vs ext4)
Hi ! I have failover clusters for some aplications. Generally with 2 members configured with Ubuntu + Drbd + Ext4. For example, my IMAP cluster works fine with ~ 50k email accounts and my HTTP cluster hosts ~2k sites. See design here: http://adminlinux.com.br/cluster_design.txt I would like to provide load balancing instead of just failover. So, I would like to use a distributed architecture of the filesystem. As we know, Ext4 isn't a distributed filesystem. So wish to use Ceph in my clusters. Any suggestions for design of the cluster with Ubuntu+Ceph? I built a simple cluster of 2 servers to test simultaneous reading and writing with Ceph. My conf: http://adminlinux.com.br/ceph_conf.txt But in my simultaneous benchmarks found errors in reading and writing. I ran iozone -t 5 -r 4k -s 2m simultaneously on both servers in the cluster. The performance was poor and had errors like this: Error in file: Found ?0? Expecting ?6d6d6d6d6d6d6d6d? addr b660 Error in file: Position 1060864 Record # 259 Record size 4 kb where b660 loop 0 Performance graphs of benchmark: http://adminlinux.com.br/ceph_bench.html Can you help me find what I did wrong? Thanks ! -- Thiago Henrique www.adminlinux.com.br ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests
On 5/22/14 11:51 , Győrvári Gábor wrote: Hello, Got this kind of logs in two node of 3 node cluster both node has 2 OSD, only affected 2 OSD on two separate node thats why i dont understand the situation. There wasnt any extra io on the system at the given time. Using radosgw with s3 api to store objects under ceph average ops around 20-150 and bw usage 100-2000kb read / sec and only 50-1000kb / sec written. osd_op(client.7821.0:67251068 default.4181.1_products/800x600/537e28022fdcc.jpg [cmpxattr user.rgw.idtag (22) op 1 mode 1,setxattr user.rgw.idtag (33),call refcount.put] 11.fe53a6fb e590) v4 *currently waiting for subops from [2] ** * Are any of your PGs in recovery or backfill? I've seen this happen two different ways. The first time was because I had the recovery and backfill parameters set too high for my cluster. If your journals aren't SSDs, the default parameters are too high. The recovery operation will use most of the IOps, and starve the clients. The second time I saw this is when one disk was starting to fail. Sectors starting failing, and the drive spent a lot of time reading and remapping bad sectors. Consumer class SATA disks will retry bad sectors for 30+ second. It happens in the drive firmware, so it's not something you can stop. Enterprise class drives will give up quicker, since they know you have another copy of the data. (Nobody uses enterprise class drives stand-alone; they're always in some sort of storage array). I've had reports of 6+ OSDs blocking subops, and I traced it back to one disk that was blocking others. I replaced that disk, and the warnings went away. If your cluster is healthy, check the SMART attributes for osd.2. If osd.2 looks good, it might another osd. Check osd.2 logs, and check any osd that are blocking osd.2. If your cluster is small, it might be faster to just check all disks instead of following the trail. -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd pool default pg num problem
If you're not using CephFS, you don't need metadata or data pools. You can delete them. If you're not using RBD, you don't need the rbd pool. If you are using CephFS, and you do delete and recreate the metadata/data pools, you'll need to tell CephFS. I think the command is ceph mds add_data_pool new_data_pool_id. I'm not using CephFS, so I can't test that. I'm don't see any commands to set the metadata pool for CephFS, but it seems strange that you have to tell it about the data pool, but not the metadata pool. On 5/23/14 11:22 , McNamara, Bradley wrote: The other thing to note, too, is that it appears you're trying to decrease the PG/PGP_num parameters, which is not supported. In order to decrease those settings, you'll need to delete and recreate the pools. All new pools created will use the settings defined in the ceph.conf file. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John Spray Sent: Friday, May 23, 2014 6:38 AM To: Cao, Buddy Cc: ceph-users@lists.ceph.com; ceph-u...@ceph.com Subject: Re: [ceph-users] osd pool default pg num problem Those settings are applied when creating new pools with osd pool create, but not to the pools that are created automatically during cluster setup. We've had the same question before (http://comments.gmane.org/gmane.comp.file-systems.ceph.user/8150), so maybe it's worth opening a ticket to do something about it. Cheers, John On Fri, May 23, 2014 at 2:01 PM, Cao, Buddy buddy@intel.com wrote: In Firefly, I added below lines to [global] section in ceph.conf, however, after creating the cluster, the default pool “metadata/data/rbd”’s pg num is still over 900 but not 375. Any suggestion? osd pool default pg num = 375 osd pool default pgp num = 375 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup mon-data?
On 05/23/2014 06:30 PM, Fabian Zimmermann wrote: Hi, Am 23.05.2014 um 17:31 schrieb Wido den Hollander w...@42on.com: I wrote a blog about this: http://blog.widodh.nl/2014/03/safely-backing-up-your-ceph-monitors/ so you assume restoring the old data is working, or did you proof this? No, that won't work in ALL situations. But it's always better to have a backup of your mons instead of having none. Fabian -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup mon-data?
On 5/23/14 09:30 , Fabian Zimmermann wrote: Hi, Am 23.05.2014 um 17:31 schrieb Wido den Hollander w...@42on.com: I wrote a blog about this: http://blog.widodh.nl/2014/03/safely-backing-up-your-ceph-monitors/ so you assume restoring the old data is working, or did you proof this? I did some of the same things, but never tested a restore (http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/3087). There is a discussion, but I can't figure out how to get gmane to show me the threaded version from a google search. I stopped doing the backups, because they seemed rather useless. The monitors have a snapshot of the cluster state right now. If you ever need to restore a monitor backup, you're effectively rolling the whole cluster back to that point in time. What happens if you've added disks after the backup? What happens if a disk has failed after the backup? What happens if you write data to the cluster after the backup? What happens if you delete data after the backup, and it gets garbage collected? All questions that can be tested and answered... with a lot of time and experimentation. I decided to add more monitors and stop taking backups. I'm still thinking about doing manual backups before a major ceph version upgrade. In that case, I'd only need to test the write/delete cases, because I can control the the add/remove disk cases. The backups would only be useful between restarting the MON and the OSD processes though. I can't really backup the OSD state[1], so once they're upgraded, there's no going back. 1: ZFS or Btrfs snapshots could do this, but neither one are recommended for production. I do plan to make snapshots once either FS is production ready. LVM snapshots could do it, but they're such a pain that I never bothered. And I have the scripts I used to use to make LVM snapshots of MySQL data directories. -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup mon-data?
On 05/23/2014 03:06 PM, Craig Lewis wrote: 1: ZFS or Btrfs snapshots could do this, but neither one are recommended for production. Out of curiosity, what's the current beef with zfs? I know what problems are cited for btrfs, but I haven't heard much about zfs lately. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw Timeout
On 5/23/14 03:47 , Georg Höllrigl wrote: On 22.05.2014 17:30, Craig Lewis wrote: On 5/22/14 06:16 , Georg Höllrigl wrote: I have created one bucket that holds many small files, separated into different directories. But whenever I try to acess the bucket, I only run into some timeout. The timeout is at around 30 - 100 seconds. This is smaller then the Apache timeout of 300 seconds. Just so we're all talking about the same things, what does many small files mean to you? Also, how are you separating them into directories? Are you just giving files in the same directory the same leading string, like dir1_subdir1_filename? I can only estimate how many files. ATM I've 25M files on the origin but only 1/10th has been synced to radosgw. These are distributed throuhg 20 folders, each containing about 2k directories with ~ 100 - 500 files each. Do you think that's too much in that usecase? The recommendations I've seen indicate that 25M objects per bucket is doable, but painful. The bucket is itself an object stored in Ceph, which stores the list of objects in that bucket. With a single bucket containing 25M objects, you're going to hotspot on the bucket. Think of a bucket like a directory on a filesystem. You wouldn't store 25M files in a single directory. Buckets are a bit simpler than directories. They don't have to track permissions, per file ACLs, and all the other things that POSIX filesystems do. You can push them harder than a normal directory, but the same concepts still apply. The more files you put in a bucket/directory, the slower it gets. Most filesystems impose a hard limit on the number of files in a directory. RadosGW doesn't have a limit, it just gets slower. Even the list of buckets has this problem. You wouldn't want to create 25M buckets with one object each. By default, there is a 1000 bucket limit per user, but you can increase that. If you can handle using 20 buckets, it would be worthwhile to put each one of your top 20 folders into it's own bucket. If you can break it apart even more, that would be even better. I mentioned that I have a bunch of buckets with ~1M objects each. GET and PUT of objects is still fast, but listing the contents of the bucket takes a long time. Each bucket takes 20-30 minutes to get a full listing. If you're going to be doing a lot of bucket listing, you might want to keep each bucket below 1000 items. Maybe each of your 2k directories gets it's own bucket. If using more than one bucket is difficult, then 25M objects in one bucket will work. -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Questions about zone and disater recovery
On 5/21/14 19:49 , wsnote wrote: Hi,everyone! I have 2 ceph clusters, one master zone, another secondary zone. Now I have some question. 1. Can ceph have two or more secondary zones? It's supposed to work, but I haven't tested it. 2. Can the role of master zone and secondary zone transform mutual? I mean I can change the secondary zone to be master and the master zone to secondary. Yes and no. You can promote the slave to a master at any time by disabling replication, and writing to it. You'll want to update your region and zone maps, but that's only required to make replication between zones work. Converting the master to a secondary zone... I don't know. Everything will work if you delete the contents of the old master, set it up as a new secondary of the new master, and re-replicate everything. Nobody wants to do that. It would be nice if you could just point the old master (with it's existing data) at the new master, and it would start replicating. I can't answer that. 3. How to deal with the situation when the master zone is down? Now the secondary zone forbids all the operations of files, such as create objects, delete objects. When the master zone is down, users can't do anything to the files except read objects from the secondary zone. It's a bad user experience. Additionly, it will have a bad influence on the confidence of the users. I know the limit of secondary zone is out of consideration for the consistency of data. However, is there another way to improve some experience? I think: There can be a config that allow the files operations of the secondary zone.If the master zone is down, the admin can enable it, then the users can do files opeartions as usually. The secondary record all the files operations of the files. When the master zone gets right, the admin can sync files to the master zone manually. The secondary zone tracks what metadata operations that it has replayed from the master zone. It does this per bucket. In theory, there's no reason you can have additional buckets in the slave zone that the master zone doesn't have. Since these buckets aren't replicated, there shouldn't be a problem writing to them. In theory, you should even be able to write objects to the existing buckets in the slave, as long as the master doesn't have those objects. I don't know what would happen if you created one of those buckets or objects on the master. Maybe replication breaks, or maybe it just overwrites the data in the slave. That's a lot of in theory though. I wouldn't attempt it without a lot of simulation in test clusters. -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup mon-data?
Hello Dimitri, Le 23 mai 2014 à 22:33, Dimitri Maziuk dmaz...@bmrb.wisc.edu a écrit : On 05/23/2014 03:06 PM, Craig Lewis wrote: 1: ZFS or Btrfs snapshots could do this, but neither one are recommended for production. Out of curiosity, what's the current beef with zfs? I know what problems are cited for btrfs, but I haven't heard much about zfs lately. The Linux implementation (ZoL) is actually stable for production, but is quiet memory hungry because of a spl/slab fragmentation issue ... But I would ask a question : even with a snapshot capable FS, is it sufficient to achieve a consistent backup of a running leveldb ? Or did you plan to stop/snap/start the mon ? (No knowledge at all about leveldb ...) Cheers -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] collectd / graphite / grafana .. calamari?
Hi John. Thanks for the reply, sounds very good. The extra visualizations from kibana (grafana only seems to pack a small subset, but the codebase is basically the same) look cool, will put some more in soon - seems like they can still be useful later. Looking forward to some calamari. Cheers, Ricardo On Fri, May 23, 2014 at 10:42 PM, John Spray john.sp...@inktank.com wrote: Hi Ricardo, Let me share a few notes on metrics in calamari: * We're bundling graphite, and using diamond to send home metrics. The diamond collector used in calamari has always been open source [1]. * The Calamari UI has its own graphs page that talks directly to the graphite API (the calamari REST API does not duplicate any of the graphing interface) * We also bundle the default graphite dashboard, so that folks can go to /graphite/dashboard/ on the calamari server to plot anything custom they want to. It could be quite interesting hook in Grafana there in the same way that we currently hook in the default graphite dashboard, as it grafana definitely nicer and would give us a roadmap to influxdb (a project I am quite excited about). Cheers, John 1. https://github.com/ceph/Diamond/commits/calamari On Fri, May 23, 2014 at 1:58 AM, Ricardo Rocha rocha.po...@gmail.com wrote: Hi. I saw the thread a couple days ago on ceph-users regarding collectd... and yes, i've been working on something similar for the last few days :) https://github.com/rochaporto/collectd-ceph It has a set of collectd plugins pushing metrics which mostly map what the ceph commands return. In the setup we have it pushes them to graphite and the displays rely on grafana (check for a screenshot in the link above). As it relies on common building blocks, it's easily extensible and we'll come up with new dashboards soon - things like plotting osd data against the metrics from the collectd disk plugin, which we also deploy. This email is mostly to share the work, but also to check on Calamari? I asked Patrick after the RedHat/Inktank news and have no idea what it provides, but i'm sure it comes with lots of extra sauce - he suggested to ask in the list. What's the timeline to have it open sourced? It would be great to have a look at it, and as there's work from different people in this area maybe start working together on some fancier monitoring tools. Regards, Ricardo -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests
Hello, No i dont see any backfill log in ceph.log during that period, drives are WD2000FYYZ-01UL1B1 but i did not find any informations in SMART, and yes i will check other drives too. Could i determine somehow, in which PG placed the file? Thanks 2014.05.23. 20:51 keltezéssel, Craig Lewis írta: On 5/22/14 11:51 , Győrvári Gábor wrote: Hello, Got this kind of logs in two node of 3 node cluster both node has 2 OSD, only affected 2 OSD on two separate node thats why i dont understand the situation. There wasnt any extra io on the system at the given time. Using radosgw with s3 api to store objects under ceph average ops around 20-150 and bw usage 100-2000kb read / sec and only 50-1000kb / sec written. osd_op(client.7821.0:67251068 default.4181.1_products/800x600/537e28022fdcc.jpg [cmpxattr user.rgw.idtag (22) op 1 mode 1,setxattr user.rgw.idtag (33),call refcount.put] 11.fe53a6fb e590) v4 *currently waiting for subops from [2] ** * Are any of your PGs in recovery or backfill? I've seen this happen two different ways. The first time was because I had the recovery and backfill parameters set too high for my cluster. If your journals aren't SSDs, the default parameters are too high. The recovery operation will use most of the IOps, and starve the clients. The second time I saw this is when one disk was starting to fail. Sectors starting failing, and the drive spent a lot of time reading and remapping bad sectors. Consumer class SATA disks will retry bad sectors for 30+ second. It happens in the drive firmware, so it's not something you can stop. Enterprise class drives will give up quicker, since they know you have another copy of the data. (Nobody uses enterprise class drives stand-alone; they're always in some sort of storage array). I've had reports of 6+ OSDs blocking subops, and I traced it back to one disk that was blocking others. I replaced that disk, and the warnings went away. If your cluster is healthy, check the SMART attributes for osd.2. If osd.2 looks good, it might another osd. Check osd.2 logs, and check any osd that are blocking osd.2. If your cluster is small, it might be faster to just check all disks instead of following the trail. -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ -- Győrvári Gábor - Scr34m scr...@frontember.hu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Designing a cluster with ceph and benchmark (ceph vs ext4)
Hello, On Fri, 23 May 2014 15:41:23 -0300 Listas@Adminlinux wrote: Hi ! I have failover clusters for some aplications. Generally with 2 members configured with Ubuntu + Drbd + Ext4. For example, my IMAP cluster works fine with ~ 50k email accounts and my HTTP cluster hosts ~2k sites. My mailbox servers are also multiple DRBD based cluster pairs. For performance in fully redundant storage there is isn't anything better (in the OSS, generic hardware section at least). See design here: http://adminlinux.com.br/cluster_design.txt I would like to provide load balancing instead of just failover. So, I would like to use a distributed architecture of the filesystem. As we know, Ext4 isn't a distributed filesystem. So wish to use Ceph in my clusters. You will find that all cluster/distributed filesystems have severe performance shortcomings when compared to something like Ext4. On top of that, CephFS isn't ready for production as the MDS isn't HA. A potential middle way might be to use Ceph/RBD volumes formatted in Ext4. That doesn't give you shared access, but it will allow you to separate storage and compute nodes, so when one compute node becomes busy, mount that volume from a more powerful compute node instead. That all said, I can't see any way and reason to replace my mailbox DRBD clusters with Ceph in the foreseeable future. To get similar performance/reliability to DRBD I would have to spend 3-4 times the money. Where Ceph/RBD works well is situations where you can't fit the compute needs into a storage node (as required with DRBD) and where you want to access things from multiple compute nodes, primarily for migration purposes. In short, as a shared storage for VMs. Any suggestions for design of the cluster with Ubuntu+Ceph? I built a simple cluster of 2 servers to test simultaneous reading and writing with Ceph. My conf: http://adminlinux.com.br/ceph_conf.txt Again, CephFS isn't ready for production, but other than that I know very little about it as I don't use it. However your version of Ceph is severely outdated, you really should be looking at something more recent to rule out you're experience long fixed bugs. The same goes for your entire setup and kernel. Also Ceph only starts to perform decently with many OSDs (disks) and the journals on SSDs instead of being on the same disk. Think DRBD AL metadata-internal, but with MUCH more impact. Regards, Christian But in my simultaneous benchmarks found errors in reading and writing. I ran iozone -t 5 -r 4k -s 2m simultaneously on both servers in the cluster. The performance was poor and had errors like this: Error in file: Found ?0? Expecting ?6d6d6d6d6d6d6d6d? addr b660 Error in file: Position 1060864 Record # 259 Record size 4 kb where b660 loop 0 Performance graphs of benchmark: http://adminlinux.com.br/ceph_bench.html Can you help me find what I did wrong? Thanks ! -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com