Re: [ceph-users] calamari gui
Hello, I might misunderstand this. I executed this command on ceph node like this. [ceph node]# ceph-deploy calamari connect {Calamari Server} But it seems this command should execute on calamari server. [calamari server]# ceph-deploy calamari connect {Ceph Nodes} Is this correct? I would like to know only this point. On November 14, 2014 at 5:59:19 PM, idzzy (idez...@gmail.com) wrote: Hello, I see following message on calamari GUI. -- New Calamari Installation This appears to be the first time you have started Calamari and there are no clusters currently configured. 3 Ceph servers are connected to Calamari, but no Ceph cluster has been created yet. Please use ceph-deploy to create a cluster; please see the Inktank Ceph Enterprise documentation for more details. -- And as next step, I executed below command to add ceph node to calamari server. But “no calamari-minion repo found” message is being output. -- # ceph-deploy calamari connect 10.32.37.44 [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.9): /usr/bin/ceph-deploy calamari connect 10.32.37.44 [ceph_deploy][ERROR ] RuntimeError: no calamari-minion repo found -- (10.32.37.44 is calamari server) So I added the repository to ~/.cephdeploy.conf of ceph node like this. -- [calamari-minion] name=ceph repo noarch packages baseurl=http://ceph.com/rpm-emperor/el6/noarch #baseurl=http://ceph.com/rpm-emperor/rhel6/x86_64/ enabled=1 gpgcheck=1 type=rpm-md gpgkey=https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc -- Then run again, but following error is being output. -- # ceph-deploy calamari connect 10.32.37.44 [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.9): /usr/bin/ceph-deploy calamari connect 10.32.37.44 Warning: Permanently added '10.32.37.44' (RSA) to the list of known hosts. root@10.32.37.44's password: [10.32.37.44][DEBUG ] connected to host: 10.32.37.44 [10.32.37.44][DEBUG ] detect platform information from remote host [ceph_deploy][ERROR ] RuntimeError: ImportError: No module named ceph_deploy -- How can I proceed to add ceph node to calamari server and start to use calamari guy? Sorry this may be basically question, any advise will helpful for me. Thank you. — idzzy ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to upgrade ceph from Firefly to Giant on Wheezy smothly?
thanks a lot 2014-11-16 1:44 GMT+07:00 Alexandre DERUMIER aderum...@odiso.com: simply change your debian repository to giant deb http://ceph.com/debian-giant wheezy main then apt-get update apt-get dist-upgrade on each node then /etc/init.d/ceph restart mon on each node then /etc/init.d/ceph restart osd on each node ... - Mail original - De: debian Only onlydeb...@gmail.com À: ceph-users@lists.ceph.com Envoyé: Samedi 15 Novembre 2014 08:10:30 Objet: [ceph-users] How to upgrade ceph from Firefly to Giant on Wheezy smothly? Dear all i have one Ceph Firefily test cluster on Debian Wheezy too, i want to upgrade ceph from Firefly to Giant, could you tell me how to do upgrade ? i saw the release notes like below , bu ti do not know how to upgrade, could you give me some guide ? Upgrade Sequencing -- * If your existing cluster is running a version older than v0.80.x Firefly, please first upgrade to the latest Firefly release before moving on to Giant . We have not tested upgrades directly from Emperor, Dumpling, or older releases. We *have* tested: * Firefly to Giant * Dumpling to Firefly to Giant * Please upgrade daemons in the following order: #. Monitors #. OSDs #. MDSs and/or radosgw Note that the relative ordering of OSDs and monitors should not matter, but we primarily tested upgrading monitors first. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance data collection for Ceph
Thanks Dan, If I understand correctly , perf_counters have to run against OSDs ( I mean for every OSD I have to run a check). On Fri, Nov 14, 2014 at 8:44 PM, Dan Ryder (daryder) dary...@cisco.com wrote: Hi, Take a look at the built in perf counters - http://ceph.com/docs/master/dev/perf_counters/. Through this you can get individual daemon performance as well as some cluster level statistics. Other (cluster-level) disk space utilization and pool utilization/performance is available through “ceph df detail”. Hope this helps. Dan Ryder *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *10 minus *Sent:* Friday, November 14, 2014 10:26 AM *To:* ceph-users *Subject:* [ceph-users] Performance data collection for Ceph Hi, I 'm trying to collect performance data for Ceph I 'm looking to run some commands .. on regular intervals. to collect data. Apart from ceph osd perf . Are there other commands one can use. Can I also track how much data is being replicated ? Does Ceph maintain performance counters for individual OSDs ? Something on the lines of zpool iostat . ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osd crashed while there was no space
hello, every one: These days a problem of ceph has troubled me for a long time. I build a cluster with 3 hosts and each host has three osds in it. And after that I used the command rados bench 360 -p data -b 4194304 -t 300 write --no-cleanup to test the write performance of the cluster. When the cluster is near full, there couldn't write any data to it. Unfortunately, there was a host hung up, then a lots of PG was going to migrate to other OSDs. After a while, a lots of OSD was marked down and out, my cluster couldn't work any more. The following is the output of ceph -s: cluster 002c3742-ab04-470f-8a7a-ad0658b547d6 health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625 pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean; recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons down, quorum 0,2 2,1 monmap e1: 3 mons at {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election epoch 40, quorum 0,2 2,1 osdmap e173: 9 osds: 2 up, 2 in flags full pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects 37541 MB used, 3398 MB / 40940 MB avail 945/29649 objects degraded (3.187%) 34 stale+active+degraded+remapped 176 stale+incomplete 320 stale+down+peering 53 active+degraded+remapped 408 incomplete 1 active+recovering+degraded 673 down+peering 1 stale+active+degraded 15 remapped+peering 3 stale+active+recovering+degraded+remapped 3 active+degraded 33 remapped+incomplete 8 active+recovering+degraded+remapped The following is the output of ceph osd tree: # idweight type name up/down reweight -1 9 root default -3 9 rack unknownrack -2 3 host 10.0.0.97 0 1 osd.0 down0 1 1 osd.1 down0 2 1 osd.2 down0 -4 3 host 10.0.0.98 3 1 osd.3 down0 4 1 osd.4 down0 5 1 osd.5 down0 -5 3 host 10.0.0.70 6 1 osd.6 up 1 7 1 osd.7 up 1 8 1 osd.8 down0 The following is part of output os osd.0.log -3 2014-11-14 17:33:02.166022 7fd9dd1ab700 0 filestore(/data/osd/osd.0) error (28) No space left on device not handled on operation 10 (15804.0.13, or op 13, counting from 0) -2 2014-11-14 17:33:02.216768 7fd9dd1ab700 0 filestore(/data/osd/osd.0) ENOSPC handling not implemented -1 2014-11-14 17:33:02.216783 7fd9dd1ab700 0 filestore(/data/osd/osd.0) transaction dump: ... ... 0 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7fd9dd1ab700 time 2014-11-14 17:33:02.251570 os/FileStore.cc: 2540: FAILED assert(0 == unexpected error) ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x17f8675] 2: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0x4855) [0x1534c21] 3: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x101) [0x152d67d] 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x57b) [0x152bdc3] 5: (FileStore::OpWQ::_process(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2f) [0x1553c6f] 6: (ThreadPool::WorkQueueFileStore::OpSequencer::_void_process(void*, ThreadPool::TPHandle)+0x37) [0x15625e7] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7a4) [0x18801de] 8: (ThreadPool::WorkThread::entry()+0x23) [0x1881f2d] 9: (Thread::_entry_func(void*)+0x23) [0x1998117] 10: (()+0x79d1) [0x7fd9e92bf9d1] 11: (clone()+0x6d) [0x7fd9e78ca9dd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. It seens the error code was ENOSPC(No space left), why the osd program exited with assert at this time? If there was no space left, why the cluster should choose to migrate? Only osd.6 and osd.7 was alive. I tried to restarted other OSDs, but after a while, there osds crashed again. And now I can't read the data any more. Is
[ceph-users] OSDs down
Hi all :) , I need some help, I'm in a sad situation : i've lost 2 ceph server nodes physically (5 nodes initialy/ 5 monitors). So 3 nodes left : node1, node2, node3 On my first node leaving, I've updated the crush map to remove every osds running on those 2 lost servers : Ceph osd crush remove osds ceph auth del osds ceph osd rm osds ceph osd remove my2Lostnodes So the crush map seems to be ok now on node1. Ceph osd tree on node 1 returns that every osds running on node2 are down 1 and up 1 on node 3 and up 1 on node1. Nevertheless on node3 every ceph * commands stay freezed, so I'm not sure the crush map has been updated on node2 and node3. I don't know how to set ods on node 2 up again. My node2 says it cannot connect to the cluster ! Ceph -s on node 1 gives me (so still 5 monitors): cluster 45d9195b-365e-491a-8853-34b46553db94 health HEALTH_WARN 10016 pgs degraded; 10016 pgs stuck unclean; recovery 181055/544038 objects degraded (33.280%); 11/33 in osds are down; noout flag(s) set; 2 mons down, quorum 0,1,2 node1,node2,node3; clock skew detected on mon.node2 monmap e1: 5 mons at {node1=172.23.6.11:6789/0,node2=172.23.6.12:6789/0,node3=172.23.6.13:6789/0,node4=172.23.6.14:6789/0,node5=172.23.6.15:6789/0http://172.23.6.14:6789/0,omcinfcph02d=172.23.6.15:6789/0,omcinfcph61d=172.23.6.11:6789/0,omcinfcph62d=172.23.6.12:6789/0,omcinfcph63d=172.23.6.13:6789/0}, election epoch 488, quorum 0,1,2 node1,node2,node3 mdsmap e48: 1/1/1 up {0=node3=up:active} osdmap e3852: 33 osds: 22 up, 33 in flags noout pgmap v8189785: 10016 pgs, 9 pools, 705 GB data, 177 kobjects 2122 GB used, 90051 GB / 92174 GB avail 181055/544038 objects degraded (33.280%) 10016 active+degraded client io 0 B/s rd, 233 kB/s wr, 22 op/s Thx for your help !! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.88 released
On 11/14/14 12:54 PM, Robert LeBlanc wrote: Will there be RPMs built for this release? Hi Robert, Since this is a development release, you can find the v0.88 RPMs in the development download area on ceph.com, which is currently at http://ceph.com/rpm-testing/ . (Down the road, we're working on making these URLs a bit easier to find... but for now, that's where you can find v0.88) - Ken ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance data collection for Ceph
For OSDs, that is correct. FYI - perf counters are also available for all Ceph daemon types (mon, mds, rgw). Dan Ryder From: 10 minus [mailto:t10te...@gmail.com] Sent: Monday, November 17, 2014 7:25 AM To: Dan Ryder (daryder) Cc: ceph-users Subject: Re: [ceph-users] Performance data collection for Ceph Thanks Dan, If I understand correctly , perf_counters have to run against OSDs ( I mean for every OSD I have to run a check). On Fri, Nov 14, 2014 at 8:44 PM, Dan Ryder (daryder) dary...@cisco.commailto:dary...@cisco.com wrote: Hi, Take a look at the built in perf counters - http://ceph.com/docs/master/dev/perf_counters/. Through this you can get individual daemon performance as well as some cluster level statistics. Other (cluster-level) disk space utilization and pool utilization/performance is available through “ceph df detail”. Hope this helps. Dan Ryder From: ceph-users [mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 10 minus Sent: Friday, November 14, 2014 10:26 AM To: ceph-users Subject: [ceph-users] Performance data collection for Ceph Hi, I 'm trying to collect performance data for Ceph I 'm looking to run some commands .. on regular intervals. to collect data. Apart from ceph osd perf . Are there other commands one can use. Can I also track how much data is being replicated ? Does Ceph maintain performance counters for individual OSDs ? Something on the lines of zpool iostat . ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier
Hi, Just a follow-up on this issue, we're probably hitting: http://tracker.ceph.com/issues/9285 We had the issue a few weeks ago with replicated SSD pool in front of rotational pool and turned off cache tiering. Yesterday we made a new test and activating cache tiering on a single erasure pool threw the whole ceph cluster performance to the floor (including non cached non erasure coded pools) with frequent slow write in the logs. Removing cache tiering was enough to go back to normal performance. I assume no one use cache tiering on 0.80.7 in production clusters? Sincerely, Laurent Le Sunday 09 November 2014 à 00:24 +0100, Loic Dachary a écrit : On 09/11/2014 00:03, Gregory Farnum wrote: It's all about the disk accesses. What's the slow part when you dump historic and in-progress ops? This is what I see on g1 (6% iowait) root@g1:~# ceph daemon osd.0 dump_ops_in_flight { num_ops: 0, ops: []} root@g1:~# ceph daemon osd.0 dump_ops_in_flight { num_ops: 1, ops: [ { description: osd_op(client.4407100.0:11030174 rb.0.410809.238e1f29.1038 [set-alloc-hint object_size 4194304 write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write e15613), received_at: 2014-11-09 00:14:17.385256, age: 0.538802, duration: 0.011955, type_data: [ waiting for sub ops, { client: client.4407100, tid: 11030174}, [ { time: 2014-11-09 00:14:17.385393, event: waiting_for_osdmap}, { time: 2014-11-09 00:14:17.385563, event: reached_pg}, { time: 2014-11-09 00:14:17.385793, event: started}, { time: 2014-11-09 00:14:17.385807, event: started}, { time: 2014-11-09 00:14:17.385875, event: waiting for subops from 1,10}, { time: 2014-11-09 00:14:17.386201, event: commit_queued_for_journal_write}, { time: 2014-11-09 00:14:17.386336, event: write_thread_in_journal_buffer}, { time: 2014-11-09 00:14:17.396293, event: journaled_completion_queued}, { time: 2014-11-09 00:14:17.396332, event: op_commit}, { time: 2014-11-09 00:14:17.396678, event: op_applied}, { time: 2014-11-09 00:14:17.397211, event: sub_op_commit_rec}]]}]} and it looks ok. When I go to n7 which has 20% iowait, I see a much larger output http://pastebin.com/DPxsaf6z which includes a number of event: waiting_for_osdmap. I'm not sure what to make of this and it would certainly be better if n7 had a lower iowait. Also when I ceph -w I see a new pgmap is created every second which is also not a good sign. 2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 3889 B/s rd, 2125 kB/s wr, 237 op/s 2014-11-09 00:22:48.143412 mon.0 [INF] pgmap v4389614: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1586 kB/s wr, 204 op/s 2014-11-09 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 kB/s wr, 88 op/s 2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 412 kB/s wr, 130 op/s 2014-11-09 00:22:51.281294 mon.0 [INF] pgmap v4389617: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1195 kB/s wr, 167 op/s 2014-11-09 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 5864 B/s rd, 2762 kB/s wr, 206 op/s Cheers On Sat, Nov 8, 2014 at 2:30 PM Loic Dachary l...@dachary.org mailto:l...@dachary.org wrote: Hi Greg, On 08/11/2014 20:19, Gregory Farnum wrote: When acting as a cache pool it needs to go do a lookup on the base pool for every object it hasn't encountered before. I assume that's why it's slower. (The penalty should not be nearly as high as you're seeing here, but based on the low numbers I imagine you're running everything on an overloaded laptop or something.) It's running on a small cluster that is busy but not to a point that I expect such a difference: # dsh --concurrent-shell --show-machine-names --remoteshellopt=-p -m g1 -m g2 -m g3 -m n7 -m stri dstat -c 10 3 g1: total-cpu-usage g1: usr sys idl wai hiq siq g1: 6 1 88 6 0 0 g2: total-cpu-usage g2: usr sys idl wai hiq siq g2: 4 1
Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier
I think I might be running into the same issue. I'm using Giant though. A lot of slow writes. My thoughts went to: the OSD's get too much work to do (commodity hardware), so I'll have to do some performance tuning to limit parallellism a bit. And indeed, limiting the amount of threads for different tasks reduced some of the load, but I keep getting slow writes very often, especially if the load is coming from CephFS (which is the only thing I use a cache tier for). To answer your question: no, it's not yet production, and it's not suited for production currently either. In my case the slow writes keep stacking up, until OSD's commit suicide, and then the recovery process adds even further to the load of the remaining OSD's, causing a chain reaction in which other OSD's also kill themselves. Non-optimal performance could in my case be acceptable for semi-production, but stability is essential. So I hope these issues can be fixed. Kind regards, Erik. On 17-11-14 17:45, Laurent GUERBY wrote: Hi, Just a follow-up on this issue, we're probably hitting: http://tracker.ceph.com/issues/9285 We had the issue a few weeks ago with replicated SSD pool in front of rotational pool and turned off cache tiering. Yesterday we made a new test and activating cache tiering on a single erasure pool threw the whole ceph cluster performance to the floor (including non cached non erasure coded pools) with frequent slow write in the logs. Removing cache tiering was enough to go back to normal performance. I assume no one use cache tiering on 0.80.7 in production clusters? Sincerely, Laurent Le Sunday 09 November 2014 à 00:24 +0100, Loic Dachary a écrit : On 09/11/2014 00:03, Gregory Farnum wrote: It's all about the disk accesses. What's the slow part when you dump historic and in-progress ops? This is what I see on g1 (6% iowait) root@g1:~# ceph daemon osd.0 dump_ops_in_flight { num_ops: 0, ops: []} root@g1:~# ceph daemon osd.0 dump_ops_in_flight { num_ops: 1, ops: [ { description: osd_op(client.4407100.0:11030174 rb.0.410809.238e1f29.1038 [set-alloc-hint object_size 4194304 write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write e15613), received_at: 2014-11-09 00:14:17.385256, age: 0.538802, duration: 0.011955, type_data: [ waiting for sub ops, { client: client.4407100, tid: 11030174}, [ { time: 2014-11-09 00:14:17.385393, event: waiting_for_osdmap}, { time: 2014-11-09 00:14:17.385563, event: reached_pg}, { time: 2014-11-09 00:14:17.385793, event: started}, { time: 2014-11-09 00:14:17.385807, event: started}, { time: 2014-11-09 00:14:17.385875, event: waiting for subops from 1,10}, { time: 2014-11-09 00:14:17.386201, event: commit_queued_for_journal_write}, { time: 2014-11-09 00:14:17.386336, event: write_thread_in_journal_buffer}, { time: 2014-11-09 00:14:17.396293, event: journaled_completion_queued}, { time: 2014-11-09 00:14:17.396332, event: op_commit}, { time: 2014-11-09 00:14:17.396678, event: op_applied}, { time: 2014-11-09 00:14:17.397211, event: sub_op_commit_rec}]]}]} and it looks ok. When I go to n7 which has 20% iowait, I see a much larger output http://pastebin.com/DPxsaf6z which includes a number of event: waiting_for_osdmap. I'm not sure what to make of this and it would certainly be better if n7 had a lower iowait. Also when I ceph -w I see a new pgmap is created every second which is also not a good sign. 2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 3889 B/s rd, 2125 kB/s wr, 237 op/s 2014-11-09 00:22:48.143412 mon.0 [INF] pgmap v4389614: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1586 kB/s wr, 204 op/s 2014-11-09 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 kB/s wr, 88 op/s 2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 412 kB/s wr, 130 op/s 2014-11-09 00:22:51.281294 mon.0 [INF] pgmap v4389617: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1195 kB/s wr, 167 op/s 2014-11-09 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 active+clean; 2580 GB data,
Re: [ceph-users] Creating RGW S3 User using the Admin Ops API
On Sun, Nov 16, 2014 at 10:50 PM, Wido den Hollander w...@42on.com wrote: On 17-11-14 07:44, Lei Dong wrote: I think you should send the data (uid display-name) as arguments. I successfully create user via adminOps without any problems. To be clear: PUT /admin/user?format=jsonuid=XXXdisplay-name= Did you try this with Dumpling (0.67.X) as well? I just tested it on dumpling, and it worked. One thing that I did see is that you can get 403 response if the uid was not provided. Maybe this param is getting clobbered somehow? Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mds cluster degraded
After i rebuilt the OSD’s, the MDS went into the degraded mode and will not recover. [jshah@Lab-cephmon001 ~]$ sudo tail -100f /var/log/ceph/ceph-mds.Lab-cephmon001.log 2014-11-17 17:55:27.855861 7fffef5d3700 0 -- X.X.16.111:6800/3046050 X.X.16.114:0/838757053 pipe(0x1e18000 sd=22 :6800 s=0 pgs=0 cs=0 l=0 c=0x1e02c00).accept peer addr is really X.X.16.114:0/838757053 (socket is X.X.16.114:34672/0) 2014-11-17 17:57:27.855519 7fffef5d3700 0 -- X.X.16.111:6800/3046050 X.X.16.114:0/838757053 pipe(0x1e18000 sd=22 :6800 s=2 pgs=2 cs=1 l=0 c=0x1e02c00).fault with nothing to send, going to standby 2014-11-17 17:58:47.883799 7fffef3d1700 0 -- X.X.16.111:6800/3046050 X.X.16.114:0/26738200 pipe(0x1e1be80 sd=23 :6800 s=0 pgs=0 cs=0 l=0 c=0x1e04ba0).accept peer addr is really X.X.16.114:0/26738200 (socket is X.X.16.114:34699/0) 2014-11-17 18:00:47.882484 7fffef3d1700 0 -- X.X.16.111:6800/3046050 X.X.16.114:0/26738200 pipe(0x1e1be80 sd=23 :6800 s=2 pgs=2 cs=1 l=0 c=0x1e04ba0).fault with nothing to send, going to standby 2014-11-17 18:01:47.886662 7fffef1cf700 0 -- X.X.16.111:6800/3046050 X.X.16.114:0/3673954317 pipe(0x1e1c380 sd=24 :6800 s=0 pgs=0 cs=0 l=0 c=0x1e05540).accept peer addr is really X.X.16.114:0/3673954317 (socket is X.X.16.114:34718/0) 2014-11-17 18:03:47.885488 7fffef1cf700 0 -- X.X.16.111:6800/3046050 X.X.16.114:0/3673954317 pipe(0x1e1c380 sd=24 :6800 s=2 pgs=2 cs=1 l=0 c=0x1e05540).fault with nothing to send, going to standby 2014-11-17 18:04:47.888983 7fffeefcd700 0 -- X.X.16.111:6800/3046050 X.X.16.114:0/3403131574 pipe(0x1e18a00 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x1e05280).accept peer addr is really X.X.16.114:0/3403131574 (socket is X.X.16.114:34744/0) 2014-11-17 18:06:47.888427 7fffeefcd700 0 -- X.X.16.111:6800/3046050 X.X.16.114:0/3403131574 pipe(0x1e18a00 sd=25 :6800 s=2 pgs=2 cs=1 l=0 c=0x1e05280).fault with nothing to send, going to standby 2014-11-17 20:02:03.558250 707de700 -1 mds.0.1 *** got signal Terminated *** 2014-11-17 20:02:03.558297 707de700 1 mds.0.1 suicide. wanted down:dne, now up:active 2014-11-17 20:02:56.053339 77fe77a0 0 ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process ceph-mds, pid 3424727 2014-11-17 20:02:56.121367 730e4700 1 mds.-1.0 handle_mds_map standby 2014-11-17 20:02:56.124343 730e4700 1 mds.0.2 handle_mds_map i am now mds.0.2 2014-11-17 20:02:56.124345 730e4700 1 mds.0.2 handle_mds_map state change up:standby -- up:replay 2014-11-17 20:02:56.124348 730e4700 1 mds.0.2 replay_start 2014-11-17 20:02:56.124359 730e4700 1 mds.0.2 recovery set is 2014-11-17 20:02:56.124362 730e4700 1 mds.0.2 need osdmap epoch 93, have 92 2014-11-17 20:02:56.124363 730e4700 1 mds.0.2 waiting for osdmap 93 (which blacklists prior instance) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs down
Firstly, any chance of getting node4 and node5 back up? You can move the disks (monitor and osd) to a new chasis, and bring it back up. As long as it has the same IP as the original node4 and node5, the monitor should join. How much is the clock skewed on node2? I haven't had problems with small skew (~100 ms), but I've seen posts to the mailing list about large skews (minutes) causing quorum and authentication problems. When you say Nevertheless on node3 every ceph * commands stay freezed, do you by chance mean node2 instead of node3? If so, that supports the clock skew being a problem, preventing the commands and the OSDs from authenticating with the monitors. If you really did mean node3, then something strange else going on. On Mon, Nov 17, 2014 at 7:07 AM, NEVEU Stephane stephane.ne...@thalesgroup.com wrote: Hi all J , I need some help, I’m in a sad situation : i’ve lost 2 ceph server nodes physically (5 nodes initialy/ 5 monitors). So 3 nodes left : node1, node2, node3 On my first node leaving, I’ve updated the crush map to remove every osds running on those 2 lost servers : Ceph osd crush remove osds ceph auth del osds ceph osd rm osds ceph osd remove my2Lostnodes So the crush map seems to be ok now on node1. Ceph osd tree on node 1 returns that every osds running on node2 are “down 1” and “up 1” on node 3 and “up 1” on node1. Nevertheless on node3 every ceph * commands stay freezed, so I’m not sure the crush map has been updated on node2 and node3. I don’t know how to set ods on node 2 up again. My node2 says it cannot connect to the cluster ! Ceph –s on node 1 gives me (so still 5 monitors): cluster 45d9195b-365e-491a-8853-34b46553db94 health HEALTH_WARN 10016 pgs degraded; 10016 pgs stuck unclean; recovery 181055/544038 objects degraded (33.280%); 11/33 in osds are down; noout flag(s) set; 2 mons down, quorum 0,1,2 node1,node2,node3; clock skew detected on mon.node2 monmap e1: 5 mons at {node1= 172.23.6.11:6789/0,node2=172.23.6.12:6789/0,node3=172.23.6.13:6789/0,node4=172.23.6.14:6789/0,node5=172.23.6.15:6789/0 http://172.23.6.14:6789/0,omcinfcph02d=172.23.6.15:6789/0,omcinfcph61d=172.23.6.11:6789/0,omcinfcph62d=172.23.6.12:6789/0,omcinfcph63d=172.23.6.13:6789/0}, election epoch 488, quorum 0,1,2 node1,node2,node3 mdsmap e48: 1/1/1 up {0=node3=up:active} osdmap e3852: 33 osds: 22 up, 33 in flags noout pgmap v8189785: 10016 pgs, 9 pools, 705 GB data, 177 kobjects 2122 GB used, 90051 GB / 92174 GB avail 181055/544038 objects degraded (33.280%) 10016 active+degraded client io 0 B/s rd, 233 kB/s wr, 22 op/s Thx for your help !! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd crashed while there was no space
At this point, it's probably best to delete the pool. I'm assuming the pool only contains benchmark data, and nothing important. Assuming you can delete the pool: First, figure out the ID of the data pool. You can get that from ceph osd dump | grep '^pool' Once you have the number, delete the data pool: rados rmpool data data --yes-i-really-really-mean-it That will only free up space on OSDs that are up. You'll need to manually some PGs on the OSDs that are 100% full. Go to /var/lib/ceph/osd/ceph-OSDID/current, and delete a few directories that start with your data pool ID. You don't need to delete all of them. Once the disk is below 95% full, you should be able to start that OSD. Once it's up, it will finish deleting the pool. If you can't delete the pool, it is possible, but it's more work, and you still run the risk of losing data if you make a mistake. You need to disable backfilling, then delete some PGs on each OSD that's full. Try to only delete one copy of each PG. If you delete every copy of a PG on all OSDs, then you lost the data that was in that PG. As before, once you delete enough that the disk is less than 95% full, you can start the OSD. Once you start it, start deleting your benchmark data out of the data pool. Once that's done, you can re-enable backfilling. You may need to scrub or deep-scrub the OSDs you deleted data from to get everything back to normal. So how did you get the disks 100% full anyway? Ceph normally won't let you do that. Did you increase mon_osd_full_ratio, osd_backfill_full_ratio, or osd_failsafe_full_ratio? On Mon, Nov 17, 2014 at 7:00 AM, han vincent hang...@gmail.com wrote: hello, every one: These days a problem of ceph has troubled me for a long time. I build a cluster with 3 hosts and each host has three osds in it. And after that I used the command rados bench 360 -p data -b 4194304 -t 300 write --no-cleanup to test the write performance of the cluster. When the cluster is near full, there couldn't write any data to it. Unfortunately, there was a host hung up, then a lots of PG was going to migrate to other OSDs. After a while, a lots of OSD was marked down and out, my cluster couldn't work any more. The following is the output of ceph -s: cluster 002c3742-ab04-470f-8a7a-ad0658b547d6 health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625 pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean; recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons down, quorum 0,2 2,1 monmap e1: 3 mons at {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election epoch 40, quorum 0,2 2,1 osdmap e173: 9 osds: 2 up, 2 in flags full pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects 37541 MB used, 3398 MB / 40940 MB avail 945/29649 objects degraded (3.187%) 34 stale+active+degraded+remapped 176 stale+incomplete 320 stale+down+peering 53 active+degraded+remapped 408 incomplete 1 active+recovering+degraded 673 down+peering 1 stale+active+degraded 15 remapped+peering 3 stale+active+recovering+degraded+remapped 3 active+degraded 33 remapped+incomplete 8 active+recovering+degraded+remapped The following is the output of ceph osd tree: # idweight type name up/down reweight -1 9 root default -3 9 rack unknownrack -2 3 host 10.0.0.97 0 1 osd.0 down0 1 1 osd.1 down0 2 1 osd.2 down0 -4 3 host 10.0.0.98 3 1 osd.3 down0 4 1 osd.4 down0 5 1 osd.5 down0 -5 3 host 10.0.0.70 6 1 osd.6 up 1 7 1 osd.7 up 1 8 1 osd.8 down0 The following is part of output os osd.0.log -3 2014-11-14 17:33:02.166022 7fd9dd1ab700 0 filestore(/data/osd/osd.0) error (28) No space left on device not handled on operation 10 (15804.0.13, or op 13, counting from 0) -2 2014-11-14 17:33:02.216768 7fd9dd1ab700 0 filestore(/data/osd/osd.0) ENOSPC handling not implemented -1 2014-11-14 17:33:02.216783 7fd9dd1ab700 0 filestore(/data/osd/osd.0) transaction dump: ... ... 0 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1
Re: [ceph-users] jbod + SMART : how to identify failing disks ?
I use `dd` to force activity to the disk I want to replace, and watch the activity lights. That only works if your disks aren't 100% busy. If they are, stop the ceph-osd daemon, and see which drive stops having activity. Repeat until you're 100% confident that you're pulling the right drive. On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic frederic.sch...@cea.fr wrote: Hi, I’m used to RAID software giving me the failing disks slots, and most often blinking the disks on the disk bays. I recently installed a DELL “6GB HBA SAS” JBOD card, said to be an LSI 2008 one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) . Since this is an LSI, I thought I’d use MegaCli to identify the disks slot, but MegaCli does not see the HBA card. Then I found the LSI “sas2ircu” utility, but again, this one fails at giving me the disk slots (it finds the disks, serials and others, but slot is always 0) Because of this, I’m going to head over to the disk bay and unplug the disk which I think corresponds to the alphabetical order in linux, and see if it’s the correct one…. But even if this is correct this time, it might not be next time. But this makes me wonder : how do you guys, Ceph users, manage your disks if you really have JBOD servers ? I can’t imagine having to guess slots that each time, and I can’t imagine neither creating serial number stickers for every single disk I could have to manage … Is there any specific advice reguarding JBOD cards people should (not) use in their systems ? Any magical way to “blink” a drive in linux ? Thanks regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD commits suicide
I did have a problem in my secondary cluster that sounds similar to yours. I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs options xfs = -i size=64k). This showed up with a lot of XFS: possible memory allocation deadlock in kmem_alloc in the kernel logs. I was able to keep things limping along by flushing the cache frequently, but I eventually re-formatted every OSD to get rid of the 64k inodes. After I finished the reformat, I had problems because of deep-scrubbing. While reformatting, I disabled deep-scrubbing. Once I re-enabled it, Ceph wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs would be doing a deep-scrub. I'm manually deep-scrubbing now, trying to spread out the schedule a bit. Once this finishes in a few day, I should be able to re-enable deep-scrubbing and keep my HEALTH_OK. My primary cluster has always been well behaved. It completed the re-format without having any problems. The clusters are nearly identical, the biggest difference being that the secondary had a higher sustained load due to a replication backlog. On Sat, Nov 15, 2014 at 12:38 PM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, Thanks for the tip, I applied these configuration settings and it does lower the load during rebuilding a bit. Are there settings like these that also tune Ceph down a bit during regular operations? The slow requests, timeouts and OSD suicides are killing me. If I allow the cluster to regain consciousness and stay idle a bit, it all seems to settle down nicely, but as soon as I apply some load it immediately starts to overstress and complain like crazy. I'm also seeing this behaviour: http://tracker.ceph.com/issues/9844 This was reported by Dmitry Smirnov 26 days ago, but the report has no response yet. Any ideas? In my experience, OSD's are quite unstable in Giant and very easily stressed, causing chain effects, further worsening the issues. It would be nice to know if this is also noticed by other users? Thanks, Erik. On 11/10/2014 08:40 PM, Craig Lewis wrote: Have you tuned any of the recovery or backfill parameters? My ceph.conf has: [osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1 Still, if it's running for a few hours, then failing, it sounds like there might be something else at play. OSDs use a lot of RAM during recovery. How much RAM and how many OSDs do you have in these nodes? What does memory usage look like after a fresh restart, and what does it look like when the problems start? Even better if you know what it looks like 5 minutes before the problems start. Is there anything interesting in the kernel logs? OOM killers, or memory deadlocks? On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg e...@logtenberg.eu mailto:e...@logtenberg.eu wrote: Hi, I have some OSD's that keep committing suicide. My cluster has ~1.3M misplaced objects, and it can't really recover, because OSD's keep failing before recovering finishes. The load on the hosts is quite high, but the cluster currently has no other tasks than just the backfilling/recovering. I attached the logfile from a failed OSD. It shows the suicide, the recent events and also me starting the OSD again after some time. It'll keep running for a couple of hours and then fail again, for the same reason. I noticed a lot of timeouts. Apparently ceph stresses the hosts to the limit with the recovery tasks, so much that they timeout and can't finish that task. I don't understand why. Can I somehow throttle ceph a bit so that it doesn't keep overrunning itself? I kinda feel like it should chill out a bit and simply recover one step at a time instead of full force and then fail. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD commits suicide
On Tue, Nov 18, 2014 at 12:54 AM, Craig Lewis cle...@centraldesktop.com wrote: I did have a problem in my secondary cluster that sounds similar to yours. I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs options xfs = -i size=64k). This showed up with a lot of XFS: possible memory allocation deadlock in kmem_alloc in the kernel logs. I was able to keep things limping along by flushing the cache frequently, but I eventually re-formatted every OSD to get rid of the 64k inodes. After I finished the reformat, I had problems because of deep-scrubbing. While reformatting, I disabled deep-scrubbing. Once I re-enabled it, Ceph wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs would be doing a deep-scrub. I'm manually deep-scrubbing now, trying to spread out the schedule a bit. Once this finishes in a few day, I should be able to re-enable deep-scrubbing and keep my HEALTH_OK. Would you mind to check suggestions by following mine hints or hints from mentioned URLs from there http://marc.info/?l=linux-mmm=141607712831090w=2 with 64k again? As for me, I am not observing lock loop after setting min_free_kbytes for a half of gigabyte per OSD. Even if your locks has a different nature, it may be worthy to try anyway. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deep scrub parameter tuning
The minimum value for osd_deep_scrub_interval is osd_scrub_min_interval, and it wouldn't be advisable to go that low. I can't find the documentation, but basically Ceph will attempt a scrub sometime between osd_scrub_min_interval and osd_scrub_max_interval. If the PG hasn't been deep-scrubbed in the last osd_deep_scrub_interval seconds, it does a deep-scrub instead. So if you set osd_deep_scrub_interval to osd_scrub_min_interval, you'll never scrub your PGs, you'll only deep-scrub. Obviously, you can lower the two scrub intervals too. As Loïc says, test it well. I find when I'm playing with these values, I use injectargs to find a good value, then persist that value in the ceph.conf. On Fri, Nov 14, 2014 at 3:16 AM, Loic Dachary l...@dachary.org wrote: Hi, On 14/11/2014 12:11, Mallikarjun Biradar wrote: Hi, Default deep scrub interval is once per week, which we can set using osd_deep_scrub_interval parameter. Whether can we reduce it to less than a week or minimum interval is one week? You can reduce it to a shorter period. It is worth testing the impact on disk IO before going to production with shorter intervals though. Cheers -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Negative number of objects degraded for extended period of time
Well, after 4 days, this is probably moot. Hopefully it's finished backfilling, and your problem is gone. If not, I believe that if you fix those backfill_toofull, the negative numbers will start approaching zero. I seem to recall that negative degraded is a special case of degraded, but I don't remember exactly, and can't find any references. I have seen it before, and it went away when my cluster became healthy. As long as you still have OSDs completing their backfilling, I'd let it run. If you get to the point that all of the backfills are done, and you're left with only wait_backfill+backfill_toofull, then you can bump osd_backfill_full_ratio, mon_osd_nearfull_ratio, and maybe osd_failsafe_nearfull_ratio. If you do, be careful, and only bump them just enough to let them start backfilling. If you set them to 0.99, bad things will happen. On Thu, Nov 13, 2014 at 7:57 AM, Fred Yang frederic.y...@gmail.com wrote: Hi, The Ceph cluster we are running have few OSDs approaching to 95% 1+ weeks ago so I ran a reweight to balance it out, in the meantime, instructing application to purge data not required. But after large amount of data purge issued from application side(all OSDs' usage dropped below 20%), the cluster fall into this weird state for days, the objects degraded remain negative for more than 7 days, I'm seeing some IOs going on on OSDs consistently, but the number(negative) objects degraded does not change much: 2014-11-13 10:43:07.237292 mon.0 [INF] pgmap v5935301: 44816 pgs: 44713 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 30172 kB/s wr, 58 op/s; -13582/1468299 objects degraded (-0.925%) 2014-11-13 10:43:08.248232 mon.0 [INF] pgmap v5935302: 44816 pgs: 44713 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 26459 kB/s wr, 51 op/s; -13582/1468303 objects degraded (-0.925%) Any idea what might be happening here? It seems active+remapped+wait_backfill+backfill_toofull stuck? osdmap e43029: 36 osds: 36 up, 36 in pgmap v5935658: 44816 pgs, 32 pools, 1488 GB data, 714 kobjects 3017 GB used, 17092 GB / 20109 GB avail -13438/1475773 objects degraded (-0.911%) 44713 active+clean 1 active+backfilling 20 active+remapped+wait_backfill 27 active+remapped+wait_backfill+backfill_toofull 11 active+recovery_wait 33 active+remapped+backfilling 11 active+wait_backfill+backfill_toofull client io 478 B/s rd, 40170 kB/s wr, 80 op/s The cluster is running on v0.72.2, we are planning to upgrade cluster to firefly, but I would like to get the cluster state clean first before the upgrade. Thanks, Fred ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] jbod + SMART : how to identify failing disks ?
Hi, Try looking for file locate in a folder named Slot X where X in the number of the slot, then echoing 1 in the locate file will make the led blink. : # find /sys -name locate |grep Slot /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 01/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 02/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 03/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 04/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 05/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 06/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 07/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 08/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 09/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 10/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 11/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 12/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 13/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 14/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 15/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 16/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 17/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 18/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 19/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 20/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 21/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 22/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 23/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 24/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 25/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 26/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 27/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 28/locate LSI 9200-8e with a Supermicro JBOD 28 slots, Ubuntu 12.04, 3.13 kernel. Cheers Le 12/11/2014 14:05, SCHAER Frederic a écrit : Hi, I’m used to RAID software giving me the failing disks slots, and most
Re: [ceph-users] jbod + SMART : how to identify failing disks ?
Sorry, I forgot to say that in Slot X/device/block you could find the device name, like sdc. Cheers Le 18/11/2014 00:15, Cedric Lemarchand a écrit : Hi, Try looking for file locate in a folder named Slot X where X in the number of the slot, then echoing 1 in the locate file will make the led blink. : # find /sys -name locate |grep Slot /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 01/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 02/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 03/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 04/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 05/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 06/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 07/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 08/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 09/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 10/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 11/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 12/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 13/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 14/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 15/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 16/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 17/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 18/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 19/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 20/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 21/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 22/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 23/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 24/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 25/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 26/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 27/locate /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot 28/locate
[ceph-users] CephFS unresponsive at scale (2M files,
I’ve got a test cluster together with a ~500 OSDs and, 5 MON, and 1 MDS. All the OSDs also mount CephFS at /ceph. I’ve got Graphite pointing at a space under /ceph. Over the weekend, I drove almost 2 million metrics, each of which creates a ~3MB file in a hierarchical path, each sending a datapoint into the metric file once a minute. CephFS seemed to handle the writes ok while I was driving load. All files containing each metric are at paths like this: /ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp Today, however, with the load generator still running, reading metadata of files (e.g. directory entries and stat(2) info) in the filesystem (presumably MDS-managed data) seems nearly impossible, especially deeper into the tree. For example, in a shell cd seems to work but ls hangs, seemingly indefinitely. After turning off the load generator and allowing a while for things to settle down, everything seems to behave better. ceph status and ceph health both return good statuses the entire time. During load generation, the ceph-mds process seems pegged at between 100% and 150%, but with load generation turned off, the process has some high variability from near-idle up to similar 100-150% CPU. Hopefully, I’ve missed something in the CephFS tuning. However, I’m looking for direction on figuring out if it is, indeed, a tuning problem or if this behavior is a symptom of the “not ready for production” banner in the documentation. -- Kevin Sumner ke...@sumner.io ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS unresponsive at scale (2M files,
On Mon, 17 Nov 2014, Kevin Sumner wrote: I?ve got a test cluster together with a ~500 OSDs and, 5 MON, and 1 MDS. All the OSDs also mount CephFS at /ceph. I?ve got Graphite pointing at a space under /ceph. Over the weekend, I drove almost 2 million metrics, each of which creates a ~3MB file in a hierarchical path, each sending a datapoint into the metric file once a minute. CephFS seemed to handle the writes ok while I was driving load. All files containing each metric are at paths like this: /ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp Today, however, with the load generator still running, reading metadata of files (e.g. directory entries and stat(2) info) in the filesystem (presumably MDS-managed data) seems nearly impossible, especially deeper into the tree. For example, in a shell cd seems to work but ls hangs, seemingly indefinitely. After turning off the load generator and allowing a while for things to settle down, everything seems to behave better. ceph status and ceph health both return good statuses the entire time. During load generation, the ceph-mds process seems pegged at between 100% and 150%, but with load generation turned off, the process has some high variability from near-idle up to similar 100-150% CPU. Hopefully, I?ve missed something in the CephFS tuning. However, I?m looking for direction on figuring out if it is, indeed, a tuning problem or if this behavior is a symptom of the ?not ready for production? banner in the documentation. My first guess is that the MDS cache is just too small and it is thrashing. Try ceph mds tell 0 injectargs '--mds-cache-size 100' That's 10x bigger than the default, tho be aware that it will eat up 10x as much RAM too. We've also seen teh cache behave in a non-optimal way when evicting things, making it thrash more often than it should. I'm hoping we can implement something like MQ instead of our two-level LRU, but it isn't high on the priority list right now. sage___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] jbod + SMART : how to identify failing disks ?
On Mon, 17 Nov 2014 13:31:57 -0800 Craig Lewis wrote: I use `dd` to force activity to the disk I want to replace, and watch the activity lights. That only works if your disks aren't 100% busy. If they are, stop the ceph-osd daemon, and see which drive stops having activity. Repeat until you're 100% confident that you're pulling the right drive. I use smartctl for lighting up the disk, but same diff. JBOD can become a big PITA quickly with large deployments and if you don't have people with sufficient skill doing disk replacements. Also depending on how a disk died you might not be able to reclaim the drive ID (sdc for example) without a reboot, making things even more confusing. Some RAID cards in IT/JBOD mode _will_ actually light up the fail LED if a disk fails and/or have tools to blink a specific disk. However with the later the task of matching a disk from the controller's perspective to what linux enumerated it as is still on you. Ceph might scale up to really large deployments, but you better have a well staffed data center to come with that or deploy it in a non-JBOD fashion. Christian On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic frederic.sch...@cea.fr wrote: Hi, I’m used to RAID software giving me the failing disks slots, and most often blinking the disks on the disk bays. I recently installed a DELL “6GB HBA SAS” JBOD card, said to be an LSI 2008 one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) . Since this is an LSI, I thought I’d use MegaCli to identify the disks slot, but MegaCli does not see the HBA card. Then I found the LSI “sas2ircu” utility, but again, this one fails at giving me the disk slots (it finds the disks, serials and others, but slot is always 0) Because of this, I’m going to head over to the disk bay and unplug the disk which I think corresponds to the alphabetical order in linux, and see if it’s the correct one…. But even if this is correct this time, it might not be next time. But this makes me wonder : how do you guys, Ceph users, manage your disks if you really have JBOD servers ? I can’t imagine having to guess slots that each time, and I can’t imagine neither creating serial number stickers for every single disk I could have to manage … Is there any specific advice reguarding JBOD cards people should (not) use in their systems ? Any magical way to “blink” a drive in linux ? Thanks regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS unresponsive at scale (2M files,
On Nov 17, 2014, at 15:52, Sage Weil s...@newdream.net wrote: On Mon, 17 Nov 2014, Kevin Sumner wrote: I?ve got a test cluster together with a ~500 OSDs and, 5 MON, and 1 MDS. All the OSDs also mount CephFS at /ceph. I?ve got Graphite pointing at a space under /ceph. Over the weekend, I drove almost 2 million metrics, each of which creates a ~3MB file in a hierarchical path, each sending a datapoint into the metric file once a minute. CephFS seemed to handle the writes ok while I was driving load. All files containing each metric are at paths like this: /ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp Today, however, with the load generator still running, reading metadata of files (e.g. directory entries and stat(2) info) in the filesystem (presumably MDS-managed data) seems nearly impossible, especially deeper into the tree. For example, in a shell cd seems to work but ls hangs, seemingly indefinitely. After turning off the load generator and allowing a while for things to settle down, everything seems to behave better. ceph status and ceph health both return good statuses the entire time. During load generation, the ceph-mds process seems pegged at between 100% and 150%, but with load generation turned off, the process has some high variability from near-idle up to similar 100-150% CPU. Hopefully, I?ve missed something in the CephFS tuning. However, I?m looking for direction on figuring out if it is, indeed, a tuning problem or if this behavior is a symptom of the ?not ready for production? banner in the documentation. My first guess is that the MDS cache is just too small and it is thrashing. Try ceph mds tell 0 injectargs '--mds-cache-size 100' That's 10x bigger than the default, tho be aware that it will eat up 10x as much RAM too. We've also seen teh cache behave in a non-optimal way when evicting things, making it thrash more often than it should. I'm hoping we can implement something like MQ instead of our two-level LRU, but it isn't high on the priority list right now. sage Thanks! I’ll pursue mds cache size tuning. Is there any guidance on setting the cache and other mds tunables correctly, or is it an adjust-and-test sort of thing? Cursory searching doesn’t return any relevant documentation for ceph.com http://ceph.com/. I’m plowing through some other list posts now. -- Kevin Sumner ke...@sumner.io ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier
Hello, On Mon, 17 Nov 2014 17:45:54 +0100 Laurent GUERBY wrote: Hi, Just a follow-up on this issue, we're probably hitting: http://tracker.ceph.com/issues/9285 We had the issue a few weeks ago with replicated SSD pool in front of rotational pool and turned off cache tiering. Yesterday we made a new test and activating cache tiering on a single erasure pool threw the whole ceph cluster performance to the floor (including non cached non erasure coded pools) with frequent slow write in the logs. Removing cache tiering was enough to go back to normal performance. Ouch! I assume no one use cache tiering on 0.80.7 in production clusters? Not me and now I'm even less inclined to do so. Since this particular item is not the first one that puts cache tiers in doubt, but certainly the most compelling one. I wonder how much pressure was on that cache tier, though. If I understand the bug report correctly, this should only happen if some object gets evicted before it was fully replicated. So I suppose if the cache pool is sized correctly for the working set in question (which of course is a bugger given a 4MB granularity), things should work. Until you hit the threshold and they don't anymore... Given that this isn't fixed in Giant either, there goes my plan to speed up a cluster with ample space but insufficient IOPS with cache tiering. Christian Sincerely, Laurent Le Sunday 09 November 2014 à 00:24 +0100, Loic Dachary a écrit : On 09/11/2014 00:03, Gregory Farnum wrote: It's all about the disk accesses. What's the slow part when you dump historic and in-progress ops? This is what I see on g1 (6% iowait) root@g1:~# ceph daemon osd.0 dump_ops_in_flight { num_ops: 0, ops: []} root@g1:~# ceph daemon osd.0 dump_ops_in_flight { num_ops: 1, ops: [ { description: osd_op(client.4407100.0:11030174 rb.0.410809.238e1f29.1038 [set-alloc-hint object_size 4194304 write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write e15613), received_at: 2014-11-09 00:14:17.385256, age: 0.538802, duration: 0.011955, type_data: [ waiting for sub ops, { client: client.4407100, tid: 11030174}, [ { time: 2014-11-09 00:14:17.385393, event: waiting_for_osdmap}, { time: 2014-11-09 00:14:17.385563, event: reached_pg}, { time: 2014-11-09 00:14:17.385793, event: started}, { time: 2014-11-09 00:14:17.385807, event: started}, { time: 2014-11-09 00:14:17.385875, event: waiting for subops from 1,10}, { time: 2014-11-09 00:14:17.386201, event: commit_queued_for_journal_write}, { time: 2014-11-09 00:14:17.386336, event: write_thread_in_journal_buffer}, { time: 2014-11-09 00:14:17.396293, event: journaled_completion_queued}, { time: 2014-11-09 00:14:17.396332, event: op_commit}, { time: 2014-11-09 00:14:17.396678, event: op_applied}, { time: 2014-11-09 00:14:17.397211, event: sub_op_commit_rec}]]}]} and it looks ok. When I go to n7 which has 20% iowait, I see a much larger output http://pastebin.com/DPxsaf6z which includes a number of event: waiting_for_osdmap. I'm not sure what to make of this and it would certainly be better if n7 had a lower iowait. Also when I ceph -w I see a new pgmap is created every second which is also not a good sign. 2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 3889 B/s rd, 2125 kB/s wr, 237 op/s 2014-11-09 00:22:48.143412 mon.0 [INF] pgmap v4389614: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1586 kB/s wr, 204 op/s 2014-11-09 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 kB/s wr, 88 op/s 2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 412 kB/s wr, 130 op/s 2014-11-09 00:22:51.281294 mon.0 [INF] pgmap v4389617: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1195 kB/s wr, 167 op/s 2014-11-09 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 5864 B/s rd, 2762 kB/s wr, 206 op/s Cheers On Sat, Nov 8, 2014 at 2:30 PM Loic Dachary l...@dachary.org mailto:l...@dachary.org wrote:
Re: [ceph-users] osd crashed while there was no space
hi, craig: Your solution did work very well. But if the data is very important, when remove directory of PG from OSDs, a small mistake will result in loss of data. And if cluster is very large, do not you think delete the data on the disk from 100% to 95% is a tedious and error-prone thing, for so many OSDs, large disks, and so on. so my key question is: if there is no space in the cluster while some OSDs crashed, why the cluster should choose to migrate? And in the migrating, other OSDs will crashed one by one until the cluster could not work. 2014-11-18 5:28 GMT+08:00 Craig Lewis cle...@centraldesktop.com: At this point, it's probably best to delete the pool. I'm assuming the pool only contains benchmark data, and nothing important. Assuming you can delete the pool: First, figure out the ID of the data pool. You can get that from ceph osd dump | grep '^pool' Once you have the number, delete the data pool: rados rmpool data data --yes-i-really-really-mean-it That will only free up space on OSDs that are up. You'll need to manually some PGs on the OSDs that are 100% full. Go to /var/lib/ceph/osd/ceph-OSDID/current, and delete a few directories that start with your data pool ID. You don't need to delete all of them. Once the disk is below 95% full, you should be able to start that OSD. Once it's up, it will finish deleting the pool. If you can't delete the pool, it is possible, but it's more work, and you still run the risk of losing data if you make a mistake. You need to disable backfilling, then delete some PGs on each OSD that's full. Try to only delete one copy of each PG. If you delete every copy of a PG on all OSDs, then you lost the data that was in that PG. As before, once you delete enough that the disk is less than 95% full, you can start the OSD. Once you start it, start deleting your benchmark data out of the data pool. Once that's done, you can re-enable backfilling. You may need to scrub or deep-scrub the OSDs you deleted data from to get everything back to normal. So how did you get the disks 100% full anyway? Ceph normally won't let you do that. Did you increase mon_osd_full_ratio, osd_backfill_full_ratio, or osd_failsafe_full_ratio? On Mon, Nov 17, 2014 at 7:00 AM, han vincent hang...@gmail.com wrote: hello, every one: These days a problem of ceph has troubled me for a long time. I build a cluster with 3 hosts and each host has three osds in it. And after that I used the command rados bench 360 -p data -b 4194304 -t 300 write --no-cleanup to test the write performance of the cluster. When the cluster is near full, there couldn't write any data to it. Unfortunately, there was a host hung up, then a lots of PG was going to migrate to other OSDs. After a while, a lots of OSD was marked down and out, my cluster couldn't work any more. The following is the output of ceph -s: cluster 002c3742-ab04-470f-8a7a-ad0658b547d6 health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625 pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean; recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons down, quorum 0,2 2,1 monmap e1: 3 mons at {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election epoch 40, quorum 0,2 2,1 osdmap e173: 9 osds: 2 up, 2 in flags full pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects 37541 MB used, 3398 MB / 40940 MB avail 945/29649 objects degraded (3.187%) 34 stale+active+degraded+remapped 176 stale+incomplete 320 stale+down+peering 53 active+degraded+remapped 408 incomplete 1 active+recovering+degraded 673 down+peering 1 stale+active+degraded 15 remapped+peering 3 stale+active+recovering+degraded+remapped 3 active+degraded 33 remapped+incomplete 8 active+recovering+degraded+remapped The following is the output of ceph osd tree: # idweight type name up/down reweight -1 9 root default -3 9 rack unknownrack -2 3 host 10.0.0.97 0 1 osd.0 down0 1 1 osd.1 down0 2 1 osd.2 down0 -4 3 host 10.0.0.98 3 1 osd.3 down0 4 1 osd.4 down0 5 1 osd.5 down0 -5 3 host 10.0.0.70 6 1 osd.6 up
Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier
Le Tuesday 18 November 2014 à 10:11 +0900, Christian Balzer a écrit : Hello, On Mon, 17 Nov 2014 17:45:54 +0100 Laurent GUERBY wrote: Hi, Just a follow-up on this issue, we're probably hitting: http://tracker.ceph.com/issues/9285 I wonder how much pressure was on that cache tier, though. If I understand the bug report correctly, this should only happen if some object gets evicted before it was fully replicated. So I suppose if the cache pool is sized correctly for the working set in question (which of course is a bugger given a 4MB granularity), things should work. Until you hit the threshold and they don't anymore... Hi, Same experience a 10 GB size=3 min=2 cache on a 1 TB 4+1 ec pool and a 500 GB size=3 min=2 cache on 8 TB 3+1 ec pool (5 hosts, 9 rotational disks total). We also noticed that well after we deleted the cache and ec pool we still had frequent slow write until we restarted some of the slow write OSD. Now the slow write are very rare a short episode ~ ten seconds every few hours according to logs. Let's hope the ceph developpers will fix this bug so that people can give more testing to erasure coding, I have added a comment on the ticket. Sincerely, Laurent ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com