[ceph-users] resolved - unusual growth in cluster after replacing journalSSDs
Dear all, we finally found the reason for the unexpected growth in our cluster. The data was created by a collectd plugin [1] that measures latency by running rados bench once a minute. Since our cluster was stressed out for a while, removing the objects created by rados bench failed. We completely overlooked the log messages that should have given us the hint a lot earlier. e.g.: Jan 18 23:26:09 ceph1 ceph-osd: 2018-01-18 23:26:09.931638 7f963389f700 0 -- IP:6802/1986 submit_message osd_op_reply(374 benchmark_data_ceph3_31746_object158 [delete] v21240'22867646 uv22867646 ack = 0) v7 remote, IP:0/3091801967, failed lossy con, dropping message 0x7f96672a6680 Over time we "collected" some 1.5TB of benchmark data :( Furthermore, due to some misunderstanding we had the collectd plugin that runs the benchmarks running on two machines, doubling the stress on the cluster. And finally we created benchmark data in our main production pool, which also was a bad idea. Hope this info will be useful for someone :) [1] https://github.com/rochaporto/collectd-ceph Cheers, -- J.Hofmüller We are all idiots with deadlines. - Mike West signature.asc Description: This is a digitally signed message part ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] unusual growth in cluster after replacing journalSSDs
Hi, Am Donnerstag, den 16.11.2017, 13:44 +0100 schrieb Burkhard Linke: > > What remains is the growth of used data in the cluster. > > > > I put background information of our cluster and some graphs of > > different metrics on a wiki page: > > > > https://wiki.mur.at/Dokumentation/CephCluster > > > > Basically we need to reduce the growth in the cluster, but since we > > are > > not sure what causes it we don't have an idea. > > Just a wild guess (wiki page is not accessible yet): Oh damn, sorry! Fixed that. The wiki page is accessible now. > Are you sure that the journals were creating on the new SSD? If the > journals were created as files in the OSD directory, their size might > be accounted for in the cluster size report (assuming OSDs are > reporting their free space, not a sum of all object sizes). Yes, I am sure. Just checked and all the journal links point to the correct devices. See OSD 5 as an example: ls -l /var/lib/ceph/osd/ceph-5 total 64 -rw-r--r-- 1 root root 481 Mar 30 2017 activate.monmap -rw-r--r-- 1 ceph ceph 3 Mar 30 2017 active -rw-r--r-- 1 ceph ceph37 Mar 30 2017 ceph_fsid drwxr-xr-x 342 ceph ceph 12288 Apr 6 2017 current -rw-r--r-- 1 ceph ceph37 Mar 30 2017 fsid lrwxrwxrwx 1 root root58 Oct 17 14:43 journal -> /dev/disk/by- partuuid/f04832e3-2f09-460e-806f-4a6fe7aa1425 -rw-r--r-- 1 ceph ceph37 Oct 25 11:12 journal_uuid -rw--- 1 ceph ceph56 Mar 30 2017 keyring -rw-r--r-- 1 ceph ceph21 Mar 30 2017 magic -rw-r--r-- 1 ceph ceph 6 Mar 30 2017 ready -rw-r--r-- 1 ceph ceph 4 Mar 30 2017 store_version -rw-r--r-- 1 ceph ceph53 Mar 30 2017 superblock -rw-r--r-- 1 ceph ceph 0 Nov 7 11:45 systemd -rw-r--r-- 1 ceph ceph10 Mar 30 2017 type -rw-r--r-- 1 ceph ceph 2 Mar 30 2017 whoami Regards, -- J.Hofmüller Nisiti - Abie Nathan, 1927-2008 signature.asc Description: This is a digitally signed message part ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] unusual growth in cluster after replacing journal SSDs
Dear all, for about a month we experience something strange in our small cluster. Let me first describe what happened on the way. On Oct 4ht smartmon told us that the journal SSDs in one of our two ceph nodes will fail. Since getting replacements took way longer than expected we decided to place the journal on a spare HDD rather than have the SSD fail and leave us in an uncertain state. On Oct 17th we finally got the replacement SSDs. First we replaced broken/soon to be broken SSD and moved journals from the temporarily used HDD to the new SSD. Then we also replaced the journal SSD on the other ceph node since it would probably fail sooner than later. We performed all operations by setting noout first, then taking down the OSDs, flushing journals, replacing disks, creating new journals and starting OSDs again. We waited until the cluster was back in HEALTH_OK state before we proceeded to the next node. AFAIR mkjournal crashed once on the second node. So we ran the command again and journals where created. The next day in the morning at 6:25 (time of cron.daily jobs on Debian systems) we registered almost 2000 slow requests. We've had slow requests before, but never more than 900 per day and that was rare. Another odd thing we noticed is that the cluster had grown over night by 50GB! We currently run 12 vservers from ceph images and they are all not really busy. Usually used data would grow by 2GB per week or less. Network traffic between our three monitors has roughly doubled at the same time and stayed on that level until now. We eventually got rid of all the slow requests by removing all but one snapshot per image. We used to take nightly snapshots of all images and keep 14 snapshots per image. Now we take on snapshot per image per night, use export-diff and offload the diff to storage outside of ceph and remove the nightly snapshot right away. The only snapshot we keep is the one that the diffs are based on. What remains is the growth of used data in the cluster. I put background information of our cluster and some graphs of different metrics on a wiki page: https://wiki.mur.at/Dokumentation/CephCluster Basically we need to reduce the growth in the cluster, but since we are not sure what causes it we don't have an idea. So the main question that I have is what went gone wrong when we replaced the journal disks? And of course: how can we fix it? As always, any hint appreciated! Regards, -- J.Hofmüller Ich zitiere wie Espenlaub. - https://twitter.com/TheGurkenkaiser/status/463444397678690304 signature.asc Description: This is a digitally signed message part ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
Hi, Am Dienstag, den 18.04.2017, 18:34 + schrieb Peter Maloney: > The 'slower with every snapshot even after CoW totally flattens it' > issue I just find easy to test, and I didn't test it on hammer or > earlier, and others confirmed it, but didn't keep track of the > versions. Just make an rbd image, map it (probably... but my tests > were with qemu librbd), do fio randwrite tests with sync and direct > on the device (no need for a fs, or anything), and then make a few > snaps and watch it go way slower. > > How about we make this thread a collection of versions then. And I'll > redo my test on Thursday maybe. I did some tests now and provide the results and observations here: This is the fio config file I used: [global] ioengine=rbd clientname=admin pool=images rbdname=benchmark invalidate=0# mandatory rw=randwrite bs=4k [rbd_iodepth32] iodepth=32 Results from fio on image 'benchmark' without any snapshots: rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 fio-2.16 Starting 1 process rbd engine: RBD version: 0.1.10 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/3620KB/0KB /s] [0/905/0 iops] [eta 00m:00s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=14192: Thu Apr 20 13:11:27 2017 write: io=8192.0MB, bw=1596.2KB/s, iops=399, runt=5252799msec slat (usec): min=1, max=6708, avg=173.27, stdev=97.65 clat (msec): min=9, max=14505, avg=79.97, stdev=456.86 lat (msec): min=9, max=14505, avg=80.15, stdev=456.86 clat percentiles (msec): | 1.00th=[ 26], 5.00th=[ 28], 10.00th=[ 28], 20.00th=[ 30], | 30.00th=[ 31], 40.00th=[ 32], 50.00th=[ 33], 60.00th=[ 35], | 70.00th=[ 37], 80.00th=[ 39], 90.00th=[ 43], 95.00th=[ 47], | 99.00th=[ 1516], 99.50th=[ 3621], 99.90th=[ 7046], 99.95th=[ 8094], | 99.99th=[10159] lat (msec) : 10=0.01%, 20=0.29%, 50=96.17%, 100=1.49%, 250=0.31% lat (msec) : 500=0.21%, 750=0.15%, 1000=0.14%, 2000=0.38%, >=2000=0.85% cpu : usr=31.95%, sys=58.32%, ctx=5392823, majf=0, minf=0 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued: total=r=0/w=2097152/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=8192.0MB, aggrb=1596KB/s, minb=1596KB/s, maxb=1596KB/s, mint=5252799msec, maxt=5252799msec Disk stats (read/write): vdb: ios=6/20, merge=0/29, ticks=76/12168, in_queue=12244, util=0.23% sudo fio rbd.fio 2023.87s user 3216.33s system 99% cpu 1:27:31.92 total Now I created three snapshots of image 'benchmark'. Cluster became iresponsive (slow requests stared to appear), a new run of fio never got passed 0.0%. Removed all three snapshots. Cluster became responsive again, fio started to work like before (left it running during snapshot removal). Created one snapshot of 'benchmark' while fio was running. Cluster became iresponsive after few minutes, fio got nothing done as soon as the snapshot was made. Stopped here ;) Regards, -- J.Hofmüller mur.sat -- a space art project http://sat.mur.at/ signature.asc Description: This is a digitally signed message part ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
Hi, Am Dienstag, den 18.04.2017, 13:02 +0200 schrieb mj: > > On 04/18/2017 11:24 AM, Jogi Hofmüller wrote: > > This might have been true for hammer and older versions of ceph. > > From > > what I can tell now, every snapshot taken reduces performance of > > the > > entire cluster :( > > Really? Can others confirm this? Is this a 'wellknown fact'? > (unknown only to us, perhaps...) I have to add some more/new details now. We started removing snapshots for VMs today. We did this VM for VM and waited some time in between while monitoring the cluster. After having removed all snapshots for the third VM the cluster went back to a 'normal' state again: no more slow requests. i/o waits for VMs are down to acceptable numbers again (<10% peeks, <5% average). So, either there is one VM/image that irritates the entire cluster or we reached some kind of threshold or it's something completely different. As for the well known fact: Peter Maloney pointed that out in this thread (mail from last Thursday). Regards, -- J.Hofmüller http://thesix.mur.at/ signature.asc Description: This is a digitally signed message part ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
Hi, thanks for all you comments so far. Am Donnerstag, den 13.04.2017, 16:53 +0200 schrieb Lionel Bouton: > Hi, > > Le 13/04/2017 à 10:51, Peter Maloney a écrit : > > Ceph snapshots relly slow things down. I can confirm that now :( > We use rbd snapshots on Firefly (and Hammer now) and I didn't see any > measurable impact on performance... until we tried to remove them. We > usually have at least one snapshot per VM image, often 3 or 4. This might have been true for hammer and older versions of ceph. From what I can tell now, every snapshot taken reduces performance of the entire cluster :( So it looks like we were too naive in thinking that snapshots of VMs done in ceph could be a viable backup solution. Which brings me to the question, what are others doing for VM backup? Regards, -- J.Hofmüller http://thesix.mur.at/ signature.asc Description: This is a digitally signed message part ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
Dear David, Am Mittwoch, den 12.04.2017, 13:46 + schrieb David Turner: > I can almost guarantee what you're seeing is PG subfolder splitting. Evey day there's something new to learn about ceph ;) > When the subfolders in a PG get X number of objects, it splits into > 16 subfolders. Every cluster I manage has blocked requests and OSDs > that get marked down while this is happening. To stop the OSDs > getting marked down, I increase the osd_heartbeat_grace until the > OSDs no longer mark themselves down during this process. Thanks for the hint. I adjusted the values accordingly and will monitor our cluster. This morning there were no troubles at all btw. Still wondering what caused yesterday's mayhem ... Regards, -- J.Hofmüller Nisiti - Abie Nathan, 1927-2008 signature.asc Description: This is a digitally signed message part ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] slow requests and short OSD failures in small cluster
Dear all, we run a small cluster [1] that is exclusively used for virtualisation (kvm/libvirt). Recently we started to run into performance problems (slow requests, failing OSDs) for no *obvious* reason (at least not for us). We do nightly snapshots of VM images and keep the snapshots for 14 days. Currently we run 8 VMs in the cluster. At first it looked like the problem was related to snapshotting images of VMs that were up and running (respectively deleting the snapshots after 14 days). So we changed the procedure to first suspend the VM and the snapshot its image(s). Snapshots are made at 4 am. When we removed *all* the old snapshots (the ones done of running VMs) the cluster suddenly behaved 'normal' again, but after two days of creating snapshots (not deleting any) of suspended VMs, the slow requests started again (although by far not as frequent as before). This morning we experienced subsequent failures (e.g. osd.2 IPv4:6800/1621 failed (2 reporters from different host after 49.976472 >= grace 46.444312) of 4 of our 6 OSDs, resulting in HEALTH_WARN with up to about 20% of PGs active+undersized+degraded or stale+active+clean or remapped+peering. No OSD failure lasted longer than 4 minutes. After 15 minutes everything was back to normal again. The noise started at 6:25 am, a time when cron.daily scripts run here. We have no clue what could have caused this behavior :( There seems to be no shortage of resources (CPU, RAM, network) that would explain what happened, but maybe we did not look in the right places. So any hint on where to look/what to look for would be greatly appreciated :) [1] cluster setup Three nodes: ceph1, ceph2, ceph3 ceph1 and ceph2 1x Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz 32 GB RAM RAID1 for OS 1x Intel 530 Series SSDs (120GB) for Journals 3x WDC WD2500BUCT-63TWBY0 for OSDs (1TB) 2x Gbit Ethernet bonded (802.3ad) on HP 2920 Stack ceph3 virtual machine 1 CPU 4 GB RAM Software Debian GNU/Linux Jessie (8.7) Kernel 3.16 ceph 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f) Ceph Services 3 Monitors: ceph1, ceph2, ceph3 6 OSDs: ceph1 (3), ceph2 (3) Regards, -- J.Hofmüller Nisiti - Abie Nathan, 1927-2008 signature.asc Description: This is a digitally signed message part ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] solved: ceph-deploy mon create-initial fails on Debian/Jessie
Hi all, Well, after repeating the procedure a few times I once ran ceph-deploy forgetkeys and voila, that did it. Sorry for the noise, -- J.Hofmüller Ein literarisches Meisterwerk ist nur ein Wörterbuch in Unordnung. - Jean Cocteau signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-deploy mon create-initial fails on Debian/Jessie
Hi all, I am reinstalling our test cluster and run into problems when running ceph-deploy mon create-initial It fails stating: [ceph_deploy.gatherkeys][WARNIN] Unable to find /var/lib/ceph/bootstrap-osd/ceph.keyring on ceph1 [ceph_deploy][ERROR ] KeyNotFoundError: Could not find keyring file: /var/lib/ceph/bootstrap-osd/ceph.keyring on host ceph1 ceph1 is one of the two nodes I use in our test cluster. After running ceph-deploy install ceph1 ceph2 I notice that the directories under /var/lib/ceph/ stay empty. So there really are no keys but from what people on IRC said they should be there. I basically (as always) followed the instructions from here: http://docs.ceph.com/docs/v0.94.5/start/quick-ceph-deploy/ using Debian/Jessie (8.2) systems. sudo for my ceph user works fine as does everything up to the point when I run the above mentioned creation of the initial monitor. Did I hit a bug? Cheers, -- j.hofmüller mur.sat -- a space art projecthttp://sat.mur.at/ signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cant get cluster to become healthy. "stale+undersized+degraded+peered"
Hi Kurt, Am 2015-09-30 um 17:09 schrieb Kurt Bauer: > You have two nodes but repl.size 3 for your test-data pool. With the > default crushmap this won't work as it tries to replicate on different > nodes. > > So either change to rep.size 2, or add another node ;-) Thanks a lot! I did not set anything specific when creating the pool; 3 is the default as I know now. Setting size manually to two worked. ceph osd pool set test-data size 2 and I put that in my config too :) Regards, -- j.hofmüller We are all idiots with deadlines. - Mike West signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cant get cluster to become healthy. "stale+undersized+degraded+peered"
Hi, Some more info: ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 3.59998 root default -2 1.7 host ceph1 0 0.8 osd.0 up 1.0 1.0 1 0.8 osd.1 up 1.0 1.0 -3 1.7 host ceph2 2 0.8 osd.2 up 1.0 1.0 3 0.8 osd.3 up 1.0 1.0 With on pool that contains no objects: ceph status cluster 2d766dc4-0705-46f9-b559-664e49e0da5c health HEALTH_WARN 128 pgs degraded 128 pgs stuck degraded 128 pgs stuck unclean 128 pgs stuck undersized 128 pgs undersized monmap e1: 1 mons at {ceph1=172.16.16.17:6789/0} election epoch 2, quorum 0 ceph1 osdmap e22: 4 osds: 4 up, 4 in pgmap v45: 128 pgs, 1 pools, 0 bytes data, 0 objects 6768 kB used, 3682 GB / 3686 GB avail 128 active+undersized+degraded ceph osd dump epoch 22 fsid 2d766dc4-0705-46f9-b559-664e49e0da5c created 2015-09-30 16:09:58.109963 modified 2015-09-30 16:46:00.625417 flags pool 1 'test-data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 21 flags hashpspool stripe_width 0 max_osd 4 osd.0 up in weight 1 up_from 4 up_thru 21 down_at 0 last_clean_interval [0,0) PUB.17:6800/11953 172.16.16.17:6800/11953 172.16.16.17:6801/11953 PUB.17:6801/11953 exists,up e384b160-d213-40a4-b3f1-a9146aaa41e1 osd.1 up in weight 1 up_from 8 up_thru 21 down_at 0 last_clean_interval [0,0) PUB.17:6802/12839 172.16.16.17:6802/12839 172.16.16.17:6803/12839 PUB.17:6803/12839 exists,up 4c14bda4-3c31-4188-976e-7f59fd717294 osd.2 up in weight 1 up_from 12 up_thru 21 down_at 0 last_clean_interval [0,0) PUB.18:6800/6583 172.16.16.18:6800/6583 172.16.16.18:6801/6583 89.106.208.18:6801/6583 exists,up 3dd88154-63b7-476d-b8c2-8a34483eb358 osd.3 up in weight 1 up_from 17 up_thru 21 down_at 0 last_clean_interval [0,0) PUB.18:6802/7453 172.16.16.18:6802/7453 172.16.16.18:6803/7453 PUB.18:6803/7453 exists,up 1a96aa8d-c13d-4536-b772-b4189e0069ff After deleting the pool: ceph status cluster 2d766dc4-0705-46f9-b559-664e49e0da5c health HEALTH_WARN too few PGs per OSD (0 < min 30) monmap e1: 1 mons at {ceph1=172.16.16.17:6789/0} election epoch 2, quorum 0 ceph1 osdmap e23: 4 osds: 4 up, 4 in pgmap v48: 0 pgs, 0 pools, 0 bytes data, 0 objects 6780 kB used, 3682 GB / 3686 GB avail ceph osd dump epoch 23 fsid 2d766dc4-0705-46f9-b559-664e49e0da5c created 2015-09-30 16:09:58.109963 modified 2015-09-30 16:56:24.678984 flags max_osd 4 osd.0 up in weight 1 up_from 4 up_thru 21 down_at 0 last_clean_interval [0,0) PUB.17:6800/11953 172.16.16.17:6800/11953 172.16.16.17:6801/11953 PUB.17:6801/11953 exists,up e384b160-d213-40a4-b3f1-a9146aaa41e1 osd.1 up in weight 1 up_from 8 up_thru 21 down_at 0 last_clean_interval [0,0) PUB.17:6802/12839 172.16.16.17:6802/12839 172.16.16.17:6803/12839 89.106.208.17:6803/12839 exists,up 4c14bda4-3c31-4188-976e-7f59fd717294 osd.2 up in weight 1 up_from 12 up_thru 21 down_at 0 last_clean_interval [0,0) PUB.18:6800/6583 172.16.16.18:6800/6583 172.16.16.18:6801/6583 PUB.18:6801/6583 exists,up 3dd88154-63b7-476d-b8c2-8a34483eb358 osd.3 up in weight 1 up_from 17 up_thru 21 down_at 0 last_clean_interval [0,0) PUB.18:6802/7453 172.16.16.18:6802/7453 172.16.16.18:6803/7453 PUB.18:6803/7453 exists,up 1a96aa8d-c13d-4536-b772-b4189e0069ff Regards, -- j.hofmüller Gerüchtegenerator http://plagi.at/geruecht signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cant get cluster to become healthy. "stale+undersized+degraded+peered"
Hi, Am 2015-09-17 um 19:02 schrieb Stefan Eriksson: > I purged all nodes and did purgedata aswell and restarted, after this > Everything was fine. You are most certainly right, if anyone else have > this error, reinitialize the cluster might be the fastest way forward. Great that it worked for you, it didn't for me. The second installation of ceph on two nodes with 4 osds and I still oscillate between your original problem (with a default pool from installation that I cannot explain where it came from) and the too few PGs per OSD (0 < min 30 when I delete the default pool. I basically followed the procedure described here [1] and made some modifications to the config before calling 'ceph-deploy install' on my nodes. Here is the config I use (fsid and IPs deleted): [global] fsid = ID mon_initial_members = ceph1 mon_host = private-ip auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public_network = public-network cluster_network = private-network osd_pool_default_size = 2 osd_pool_default_min_size = 1 osd_pool_default_pg_num = 150 osd_pool_default_pgp_num = 150 osd_crush_chooseleaf_type = 1 [osd] osd_journal_size = 1 [1] http://docs.ceph.com/docs/master/start/quick-ceph-deploy/ -- J.Hofmüller Ein literarisches Meisterwerk ist nur ein Wörterbuch in Unordnung. - Jean Cocteau signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [sepia] debian jessie repository ?
Hi, Am 2015-09-29 um 15:54 schrieb Gregory Farnum: > Can you create a ceph-deploy ticket at tracker.ceph.com, please? > And maybe make sure you're running the latest ceph-deploy, but > honestly I've no idea what it's doing these days or if this is a > resolved issue. Just file a bug. The ceph-deploy version installed here is 1.5.28. I installed it according to the docs [1] via apt-get. FWIW I managed to get ceph installed on Debian Jessie by doing the following (on each node) 1) install the repository key manually 2) setting /etc/apt/sources.list.d/ceph.list to read deb http://ceph.com/debian-hammer wheezy main 3) added an entry for wheezy packages in /etc/apt/sources.list 4) set 'adjust_repos = False' in cephdeploy.conf [1] http://docs.ceph.com/docs/master/start/quick-start-preflight/#ceph-deploy-setup Regards, -- j.hofmüller mur.sat -- a space art projecthttp://sat.mur.at/ signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [sepia] debian jessie repository ?
Hi, Am 2015-09-25 um 22:23 schrieb Udo Lembke: > you can use this sources-list > > cat /etc/apt/sources.list.d/ceph.list > deb http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/ref/v0.94.3 > jessie main The thing is: whatever I write into ceph.list, ceph-deploy just overwrites it with "deb http://ceph.com/debian-hammer/ jessie main" which does not exist :( Here is what the log says after "ceph-deploy install: [ceph1][DEBUG ] Err http://ceph.com jessie/main amd64 Packages [ceph1][DEBUG ] 404 Not Found [IP: 2607:f298:6050:51f3:f816:3eff:fe50:5ec 80] [ceph1][DEBUG ] Ign http://ceph.com jessie/main Translation-en_US [ceph1][DEBUG ] Ign http://ceph.com jessie/main Translation-en [ceph1][WARNIN] W: Failed to fetch http://ceph.com/debian-hammer/dists/jessie/main/binary-amd64/Packages 404 Not Found [IP: 2607:f298:6050:51f3:f816:3eff:fe50:5ec 80] [ceph1][WARNIN] [ceph1][WARNIN] E: Some index files failed to download. They have been ignored, or old ones used instead. [ceph1][ERROR ] RuntimeError: command returned non-zero exit status: 100 [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get --assume-yes -q update Advice needed. Cheers, -- J.Hofmüller Fakten verschwinden nicht, nur weil eins sie ignoriert. - nach Aldous Huxley signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [sepia] debian jessie repository ?
Hi, Am 2015-09-25 um 22:23 schrieb Udo Lembke: > you can use this sources-list > > cat /etc/apt/sources.list.d/ceph.list > deb http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/ref/v0.94.3 > jessie main Thanks! Will test it as soon as I get back to work next week. Regards, -- j.hofmüller mur.sat -- a space art projecthttp://sat.mur.at/ signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [sepia] debian jessie repository ?
Hi, Am 2015-09-11 um 13:20 schrieb Florent B: > Jessie repository will be available on next Hammer release ;) An how should I continue installing ceph meanwhile? ceph-deploy new ... overwrites the /etc/apt/sources.list.d/ceph.list and hence throws an error :( Any hint appreciated. Cheers, -- J.Hofmüller wash your hands and say your prayers because jesus and germs are everywhere signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] new cluster does not reach active+clean
Hi Tyler, Am 2013-10-03 13:22, schrieb Tyler Brekke: > You can add this to your ceph conf to distribute by device rather then node. > > osd crush chooseleaf type = 0 Great! Thanks for reminding me. I had that in previous setups but forgot it this time. > This information is also available on the docs :) I am painfully aware of that ;) Cheers! -- j.hofmüller Optimism doesn't alter the laws of physics. - Subcommander T'Pol signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] trouble adding OSDs - which documentation to use
Dear all, This is getting weird now ... Am 2013-10-03 11:18, schrieb Jogi Hofmüller: > root@ceph-server1:~# service ceph start > === osd.0 === > No filesystem type defined! This message is generated by /etc/init.d/ceph (OK, most of you know that I guess), which is looking for "osd mkfs type" in ceph.conf. This is where it failed for me before adding these lines to ceph.conf: [osd] osd mkfs type = xfs Now, with the correct devs = /dev/sdaX in the corresponding [osdX] section everything works. I have been searching the entire documentation for these two parameters and did not find very much useful explanation/guides there. Cheers! -- j.hofmüller Optimism doesn't alter the laws of physics. - Subcommander T'Pol signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] new cluster does not reach active+clean
Dear all, Hope I am not on everyones nerves by now ;) Just started over and created a new cluster: one monitor (ceph-mon0) one osd-server (ceph-rd0) After activating the two OSDs on ceph-rd0 the cluster reaches a state active+degraded and never becomes healthy. Unfortunately this particular state is not documented here [1]. Some output: ceph@ceph-admin:~/cl0$ ceph -w cluster 6f1dfb78-e917-4286-a8f0-2e389d295e43 health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean monmap e1: 1 mons at {ceph-mon0=192.168.122.56:6789/0}, election epoch 2, quorum 0 ceph-mon0 osdmap e8: 2 osds: 2 up, 2 in pgmap v15: 192 pgs: 192 active+degraded; 0 bytes data, 69924 KB used, 6053 MB / 6121 MB avail mdsmap e1: 0/0/1 up 2013-10-03 13:09:59.99 osd.0 [INF] pg has no unfound objects ceph@ceph-admin:~/cl0$ ceph health detail HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean pg 0.3f is stuck unclean since forever, current state active+degraded, last acting [0] pg 1.3e is stuck unclean since forever, current state active+degraded, last acting [0] pg 2.3d is stuck unclean since forever, current state active+degraded, last acting [0] (cut some lines) pg 1.0 is active+degraded, acting [0] pg 0.1 is active+degraded, acting [0] pg 2.2 is active+degraded, acting [0] pg 1.1 is active+degraded, acting [0] pg 0.0 is active+degraded, acting [0] Any idea what went wrong here? [1] http://eu.ceph.com/docs/wip-3060/ops/manage/failures/osd/ Regards! -- j.hofmüller Optimism doesn't alter the laws of physics. - Subcommander T'Pol signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] trouble adding OSDs - which documentation to use
Seas Wolfgang, Am 2013-10-02 09:01, schrieb Wolfgang Hennerbichler: > On 10/01/2013 05:08 PM, Jogi Hofmüller wrote: >> Is this [1] outdated? If not, why are the links to chef-* not >> working? Is chef-* still recommended/used? > > I believe this is a matter of taste. I can not say if this is > outdated, but I prefer not to use chef but only ceph-deploy. Ah, good. That's what I was thinking somehow. > Others might have different opinions on that, but I am the > old-fashioned guy who puts the stuff into his configuration file (like > bobtail used to be). > This works for me (ceph.conf): > > [osd.0] > host = rd-c2 > devs = /dev/sdb > > [osd.1] > host = rd-c2 > devs = /dev/sdc > > ... > > On startup ceph mounts the disk to /var/lib/ceph/osd/ceph-[OSD-Number] > and works. Actually that is what I expected. Thing is (just reproduced it) that my OSDs won't start after rebooting the host. For example: root@ceph-server1:~# service ceph start === osd.0 === No filesystem type defined! This is the relevant part of the config: [osd.0] host = ceph-server1 devs = /dev/sdb And now I see that ceph-deploy disk list run on the admin host tells me: ceph@ceph-admin:~/mur-cluster$ ceph-deploy disk list ceph-server1 [ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection with sudo [ceph_deploy.osd][INFO ] Distro info: Debian 7.1 wheezy [ceph_deploy.osd][DEBUG ] Listing disks on ceph-server1... [ceph-server1][INFO ] Running command: ceph-disk list [ceph-server1][INFO ] /dev/sda : [ceph-server1][INFO ] /dev/sda1 ceph data, prepared, unknown cluster b134da22-a3dd-41cb-95c2-fb6a75af8c1f, osd.0, journal /dev/sda2 [ceph-server1][INFO ] /dev/sda2 ceph journal, for /dev/sda1 [ceph-server1][INFO ] /dev/sdb : [ceph-server1][INFO ] /dev/sdb1 ceph data, prepared, unknown cluster b134da22-a3dd-41cb-95c2-fb6a75af8c1f, osd.1, journal /dev/sdb2 [ceph-server1][INFO ] /dev/sdb2 ceph journal, for /dev/sdb1 Which completely fries my brain (unknown cluster b134da22-a3dd-41cb-95c2-fb6a75af8c1f) ... Any hint on what went wrong here? Is the unknown cluster the reason for the unknown filesystem? Cheers! -- j.hofmüller Optimism doesn't alter the laws of physics. - Subcommander T'Pol signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD: Newbie question regarding ceph-deploy odd create
Hi Piers, Am 2013-09-27 22:59, schrieb Piers Dawson-Damer: > I'm trying to setup my first cluster, (have never manually > bootstrapped a cluster) I am about at the same stage here ;) > Is ceph-deploy odd activate/prepare supposed to write to the master > ceph.conf file, specific entries for each OSD along the lines > of http://ceph.com/docs/master/rados/configuration/osd-config-ref/ ? All I can say that it does not do so. Still waiting for an answer to a similar question I posed yesterday ... I let you know if I get closer to solving these things ;) Cheers! -- j.hofmüller Optimism doesn't alter the laws of physics. - Subcommander T'Pol signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] trouble adding OSDs - which documentation to use
Dear all, I am back to managing the cluster before starting to use it even on a test host. First of all a question regarding the docs: Is this [1] outdated? If not, why are the links to chef-* not working? Is chef-* still recommended/used? After adding a new OSD (with ceph-deploy version 1.2.6) and starting the daemon after a reboot of the osd-server it complains: root@ceph-server1:~# service ceph start === osd.0 === No filesystem type defined! I could not find anything in the docs on how to specify the fs-type. How is mounting the data-partition done usually? It works if I mount it via an entry in /etc/fstab (or manually) but I would have to edit that manually. All this is done using ceph "dumpling" installed/deployed according to the getting started info from [2]. [1] http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ [2] http://ceph.com/docs/master/start/quick-ceph-deploy/ Regards! -- j.hofmüller Optimism doesn't alter the laws of physics. - Subcommander T'Pol signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] authentication trouble
Dear all, I am fairly new to ceph and just in the process of testing it using several virtual machines. Now I tried to create a block device on a client and fumbled with settings for about an hour or two until the command line rbd --id dovecot create home --size=1024 finally succeeded. The keyring is /etc/ceph/ceph.keyring and I thought the name [client.dovecot] would be used by rbd. I would appreciated any hint on how to configure the client.NAME in the config to ease operation. Regards! -- j.hofmüller Optimism doesn't alter the laws of physics. - Subcommander T'Pol signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com