[ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds
hi, everyone when I user rest bench testing RGW with cmd : rest-bench --access-key=ak --secret=sk --bucket=bucket --seconds=360 -t 200 -b 524288 --no-cleanup write I found when RGW call the method bucket_prepare_op is very slow. so I observed from 'dump_historic_ops',to see: { description: osd_op(client.4211.0:265984 .dir.default.4148.1 [call rgw.bucket_prepare_op] 3.b168f3d0 e37), received_at: 2014-07-03 11:07:02.465700, age: 308.315230, duration: 3.401743, type_data: [ commit sent; apply or cleanup, { client: client.4211, tid: 265984}, [ { time: 2014-07-03 11:07:02.465852, event: waiting_for_osdmap}, { time: 2014-07-03 11:07:02.465875, event: queue op_wq}, { time: 2014-07-03 11:07:03.729087, event: reached_pg}, { time: 2014-07-03 11:07:03.729120, event: started}, { time: 2014-07-03 11:07:03.729126, event: started}, { time: 2014-07-03 11:07:03.804366, event: waiting for subops from [19,9]}, { time: 2014-07-03 11:07:03.804431, event: commit_queued_for_journal_write}, { time: 2014-07-03 11:07:03.804509, event: write_thread_in_journal_buffer}, { time: 2014-07-03 11:07:03.934419, event: journaled_completion_queued}, { time: 2014-07-03 11:07:05.297282, event: sub_op_commit_rec}, { time: 2014-07-03 11:07:05.297319, event: sub_op_commit_rec}, { time: 2014-07-03 11:07:05.311217, event: op_applied}, { time: 2014-07-03 11:07:05.867384, event: op_commit finish lock}, { time: 2014-07-03 11:07:05.867385, event: op_commit}, { time: 2014-07-03 11:07:05.867424, event: commit_sent}, { time: 2014-07-03 11:07:05.867428, event: op_commit finish}, { time: 2014-07-03 11:07:05.867443, event: done}]]}]} so I find 2 performance degradation. one is from queue op_wq to reached_pg , anothor is from journaled_completion_queued to op_commit. and I must stess that there are so many ops write to one bucket object, so how to reduce Latency ? baijia...@126.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph RBD and Backup.
if the rbd filesystem ‘belongs’ to you you can do sth like this: http://www.wogri.com/linux/ceph-vm-backup/ On Jul 3, 2014, at 7:21 AM, Irek Fasikhov malm...@gmail.com wrote: Hi,All. Dear community. How do you make backups CEPH RDB? Thanks -- Fasihov Irek (aka Kataklysm). С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph RBD and Backup.
Am 03.07.2014 07:21, schrieb Irek Fasikhov: Dear community. How do you make backups CEPH RDB? We @ gocept are currently in the process of developing backy, a new-style backup tool that works directly with block level snapshots / diffs. The tool is not quite finished, but it is making rapid progress. It would be great if you'd try it, spot bugs, contribute code etc. Help is appreciated. :-) PyPI page: https://pypi.python.org/pypi/backy/ Pull requests go here: https://bitbucket.org/ctheune/backy Christian Theune c...@gocept.com is the primary contact. HTH Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] release date for 0.80.2
Hi guys, Was wondering if 0.80.2 is coming any time soon? I am planning na upgrade from Emperor and was wondering if I should wait for 0.80.2 to come out if the release date is pretty soon. Otherwise, I will go for the 0.80.1. Cheers Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mixing CEPH versions on new ceph nodes...
Hi Wido, thanks for answers - I have mons and OSD on each host... server1: mon + 2 OSDs, same for server2 and server3. Any Proposed upgrade path, or just start with 1 server and move along to others ? Thanks again. Andrija On 2 July 2014 16:34, Wido den Hollander w...@42on.com wrote: On 07/02/2014 04:08 PM, Andrija Panic wrote: Hi, I have existing CEPH cluster of 3 nodes, versions 0.72.2 I'm in a process of installing CEPH on 4th node, but now CEPH version is 0.80.1 Will this make problems running mixed CEPH versions ? No, but the recommendation is not to have this running for a very long period. Try to upgrade all nodes to the same version within a reasonable amount of time. I intend to upgrade CEPH on exsiting 3 nodes anyway ? Recommended steps ? Always upgrade the monitors first! Then to the OSDs one by one. Thanks -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] release date for 0.80.2
On 07/03/2014 10:27 AM, Andrei Mikhailovsky wrote: Hi guys, Was wondering if 0.80.2 is coming any time soon? I am planning na upgrade from Emperor and was wondering if I should wait for 0.80.2 to come out if the release date is pretty soon. Otherwise, I will go for the 0.80.1. Why bother? Upgrading from 0.80.1 to .2 is not that much work. Or is there a specific bug in 0.80.1 which you don't want to run into? Cheers Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mixing CEPH versions on new ceph nodes...
On 07/03/2014 10:59 AM, Andrija Panic wrote: Hi Wido, thanks for answers - I have mons and OSD on each host... server1: mon + 2 OSDs, same for server2 and server3. Any Proposed upgrade path, or just start with 1 server and move along to others ? Upgrade the packages, but don't restart the daemons yet, then: 1. Restart the mon leader 2. Restart the two other mons 3. Restart all the OSDs one by one I suggest that you wait for the cluster to become fully healthy again before restarting the next OSD. Wido Thanks again. Andrija On 2 July 2014 16:34, Wido den Hollander w...@42on.com mailto:w...@42on.com wrote: On 07/02/2014 04:08 PM, Andrija Panic wrote: Hi, I have existing CEPH cluster of 3 nodes, versions 0.72.2 I'm in a process of installing CEPH on 4th node, but now CEPH version is 0.80.1 Will this make problems running mixed CEPH versions ? No, but the recommendation is not to have this running for a very long period. Try to upgrade all nodes to the same version within a reasonable amount of time. I intend to upgrade CEPH on exsiting 3 nodes anyway ? Recommended steps ? Always upgrade the monitors first! Then to the OSDs one by one. Thanks -- Andrija Panić _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902 Skype: contact42on _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Pools do not respond
Hi folk, I am following step by step the test intallation, and checking some configuration before try to deploy a production cluster. Now I have a Health cluster with 3 mons + 4 OSDs. I have created a pool with belonging all osd.x and two more one for two servers o the other for the other two. The general pool work fine (I can create images and mount it on remote machines). But the other two does not work (the commands rados put, or rbd ls pool hangs for ever). this is the tree: [ceph@cephadm ceph-cloud]$ sudo ceph osd tree # id weight type name up/down reweight -7 5.4 root 4x1GbFCnlSAS -3 2.7 host node04 1 2.7 osd.1 up 1 -4 2.7 host node03 2 2.7 osd.2 up 1 -6 8.1 root 4x4GbFCnlSAS -5 5.4 host node01 3 2.7 osd.3 up 1 4 2.7 osd.4 up 1 -2 2.7 host node04 0 2.7 osd.0 up 1 -1 13.5 root default -2 2.7 host node04 0 2.7 osd.0 up 1 -3 2.7 host node04 1 2.7 osd.1 up 1 -4 2.7 host node03 2 2.7 osd.2 up 1 -5 5.4 host node01 3 2.7 osd.3 up 1 4 2.7 osd.4 up 1 And this is the crushmap: ... root 4x4GbFCnlSAS { id -6 #do not change unnecessarily alg straw hash 0 # rjenkins1 item node01 weight 5.400 item node04 weight 2.700 } root 4x1GbFCnlSAS { id -7 #do not change unnecessarily alg straw hash 0 # rjenkins1 item node04 weight 2.700 item node03 weight 2.700 } # rules rule 4x4GbFCnlSAS { ruleset 1 type replicated min_size 1 max_size 10 step take 4x4GbFCnlSAS step choose firstn 0 type host step emit } rule 4x1GbFCnlSAS { ruleset 2 type replicated min_size 1 max_size 10 step take 4x1GbFCnlSAS step choose firstn 0 type host step emit } .. I of course set the crush_rules: sudo ceph osd pool set cloud-4x1GbFCnlSAS crush_ruleset 2 sudo ceph osd pool set cloud-4x4GbFCnlSAS crush_ruleset 1 but seems that are something wrong (4x4GbFCnlSAS.pool is 512MB file): sudo rados -p cloud-4x1GbFCnlSAS put 4x4GbFCnlSAS.object 4x4GbFCnlSAS.pool !!HANGS for eve! from the ceph-client happen the same rbd ls cloud-4x1GbFCnlSAS !!HANGS for eve! [root@cephadm ceph-cloud]# ceph osd map cloud-4x1GbFCnlSAS 4x1GbFCnlSAS.object osdmap e49 pool 'cloud-4x1GbFCnlSAS' (3) object '4x1GbFCnlSAS.object' - pg 3.114ae7a9 (3.29) - *up ([], p-1) acting ([], p-1)* Any idea what i am doing wrong?? Thanks in advance, I Bertrand Russell: *El problema con el mundo es que los estúpidos están seguros de todo y los inteligentes están llenos de dudas* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's
Re: [ceph-users] Mixing CEPH versions on new ceph nodes...
Thanks a lot Wido, will do... Andrija On 3 July 2014 13:12, Wido den Hollander w...@42on.com wrote: On 07/03/2014 10:59 AM, Andrija Panic wrote: Hi Wido, thanks for answers - I have mons and OSD on each host... server1: mon + 2 OSDs, same for server2 and server3. Any Proposed upgrade path, or just start with 1 server and move along to others ? Upgrade the packages, but don't restart the daemons yet, then: 1. Restart the mon leader 2. Restart the two other mons 3. Restart all the OSDs one by one I suggest that you wait for the cluster to become fully healthy again before restarting the next OSD. Wido Thanks again. Andrija On 2 July 2014 16:34, Wido den Hollander w...@42on.com mailto:w...@42on.com wrote: On 07/02/2014 04:08 PM, Andrija Panic wrote: Hi, I have existing CEPH cluster of 3 nodes, versions 0.72.2 I'm in a process of installing CEPH on 4th node, but now CEPH version is 0.80.1 Will this make problems running mixed CEPH versions ? No, but the recommendation is not to have this running for a very long period. Try to upgrade all nodes to the same version within a reasonable amount of time. I intend to upgrade CEPH on exsiting 3 nodes anyway ? Recommended steps ? Always upgrade the monitors first! Then to the OSDs one by one. Thanks -- Andrija Panić _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902 Skype: contact42on _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mixing CEPH versions on new ceph nodes...
Wido, one final question: since I compiled libvirt1.2.3 usinfg ceph-devel 0.72 - do I need to recompile libvirt again now with ceph-devel 0.80 ? Perhaps not smart question, but need to make sure I don't screw something... Thanks for your time, Andrija On 3 July 2014 14:27, Andrija Panic andrija.pa...@gmail.com wrote: Thanks a lot Wido, will do... Andrija On 3 July 2014 13:12, Wido den Hollander w...@42on.com wrote: On 07/03/2014 10:59 AM, Andrija Panic wrote: Hi Wido, thanks for answers - I have mons and OSD on each host... server1: mon + 2 OSDs, same for server2 and server3. Any Proposed upgrade path, or just start with 1 server and move along to others ? Upgrade the packages, but don't restart the daemons yet, then: 1. Restart the mon leader 2. Restart the two other mons 3. Restart all the OSDs one by one I suggest that you wait for the cluster to become fully healthy again before restarting the next OSD. Wido Thanks again. Andrija On 2 July 2014 16:34, Wido den Hollander w...@42on.com mailto:w...@42on.com wrote: On 07/02/2014 04:08 PM, Andrija Panic wrote: Hi, I have existing CEPH cluster of 3 nodes, versions 0.72.2 I'm in a process of installing CEPH on 4th node, but now CEPH version is 0.80.1 Will this make problems running mixed CEPH versions ? No, but the recommendation is not to have this running for a very long period. Try to upgrade all nodes to the same version within a reasonable amount of time. I intend to upgrade CEPH on exsiting 3 nodes anyway ? Recommended steps ? Always upgrade the monitors first! Then to the OSDs one by one. Thanks -- Andrija Panić _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902 Skype: contact42on _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- Andrija Panić -- http://admintweets.com -- -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] write performance per disk
Hi, I have a ceph cluster setup (with 45 sata disk journal on disks) and get only 450mb/sec writes seq (maximum playing around with threads in rados bench) with replica of 2 Which is about ~20Mb writes per disk (what y see in atop also) theoretically with replica2 and having journals on disk should be 45 X 100mb (sata) / 2 (replica) / 2 (journal writes) which makes it 1125 satas in reality have 120mb/sec so the theoretical output should be more. I would expect to have between 40-50mb/sec for each sata disk Can somebody confirm that he can reach this speed with a setup with journals on the satas (with journals on ssd speed should be 100mb per disk)? or does ceph only give about ¼ of the speed for a disk? (and not the ½ as expected because of journals) My setup is 3 servers with: 2 x 2.6ghz xeons, 128gb ram 15 satas for ceph (and ssds for system) 1 x 10gig for external traffic, 1 x 10gig for osd traffic with reads I can saturate the network but writes is far away. And I would expect at least to saturate the 10gig with sequential writes also Thank you ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mixing CEPH versions on new ceph nodes...
On 07/03/2014 03:07 PM, Andrija Panic wrote: Wido, one final question: since I compiled libvirt1.2.3 usinfg ceph-devel 0.72 - do I need to recompile libvirt again now with ceph-devel 0.80 ? Perhaps not smart question, but need to make sure I don't screw something... No, no need to. The librados API didn't change in case you are using RBD storage pool support. Otherwise it just talks to Qemu and that talks to librbd/librados. Wido Thanks for your time, Andrija On 3 July 2014 14:27, Andrija Panic andrija.pa...@gmail.com mailto:andrija.pa...@gmail.com wrote: Thanks a lot Wido, will do... Andrija On 3 July 2014 13:12, Wido den Hollander w...@42on.com mailto:w...@42on.com wrote: On 07/03/2014 10:59 AM, Andrija Panic wrote: Hi Wido, thanks for answers - I have mons and OSD on each host... server1: mon + 2 OSDs, same for server2 and server3. Any Proposed upgrade path, or just start with 1 server and move along to others ? Upgrade the packages, but don't restart the daemons yet, then: 1. Restart the mon leader 2. Restart the two other mons 3. Restart all the OSDs one by one I suggest that you wait for the cluster to become fully healthy again before restarting the next OSD. Wido Thanks again. Andrija On 2 July 2014 16:34, Wido den Hollander w...@42on.com mailto:w...@42on.com mailto:w...@42on.com mailto:w...@42on.com wrote: On 07/02/2014 04:08 PM, Andrija Panic wrote: Hi, I have existing CEPH cluster of 3 nodes, versions 0.72.2 I'm in a process of installing CEPH on 4th node, but now CEPH version is 0.80.1 Will this make problems running mixed CEPH versions ? No, but the recommendation is not to have this running for a very long period. Try to upgrade all nodes to the same version within a reasonable amount of time. I intend to upgrade CEPH on exsiting 3 nodes anyway ? Recommended steps ? Always upgrade the monitors first! Then to the OSDs one by one. Thanks -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com mailto:ceph-us...@lists.ceph.__com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902 tel:%2B31%20%280%2920%20700%__209902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com mailto:ceph-us...@lists.ceph.__com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902 Skype: contact42on -- Andrija Panić -- http://admintweets.com -- -- Andrija Panić -- http://admintweets.com -- -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mixing CEPH versions on new ceph nodes...
Thanks again a lot. On 3 July 2014 15:20, Wido den Hollander w...@42on.com wrote: On 07/03/2014 03:07 PM, Andrija Panic wrote: Wido, one final question: since I compiled libvirt1.2.3 usinfg ceph-devel 0.72 - do I need to recompile libvirt again now with ceph-devel 0.80 ? Perhaps not smart question, but need to make sure I don't screw something... No, no need to. The librados API didn't change in case you are using RBD storage pool support. Otherwise it just talks to Qemu and that talks to librbd/librados. Wido Thanks for your time, Andrija On 3 July 2014 14:27, Andrija Panic andrija.pa...@gmail.com mailto:andrija.pa...@gmail.com wrote: Thanks a lot Wido, will do... Andrija On 3 July 2014 13:12, Wido den Hollander w...@42on.com mailto:w...@42on.com wrote: On 07/03/2014 10:59 AM, Andrija Panic wrote: Hi Wido, thanks for answers - I have mons and OSD on each host... server1: mon + 2 OSDs, same for server2 and server3. Any Proposed upgrade path, or just start with 1 server and move along to others ? Upgrade the packages, but don't restart the daemons yet, then: 1. Restart the mon leader 2. Restart the two other mons 3. Restart all the OSDs one by one I suggest that you wait for the cluster to become fully healthy again before restarting the next OSD. Wido Thanks again. Andrija On 2 July 2014 16:34, Wido den Hollander w...@42on.com mailto:w...@42on.com mailto:w...@42on.com mailto:w...@42on.com wrote: On 07/02/2014 04:08 PM, Andrija Panic wrote: Hi, I have existing CEPH cluster of 3 nodes, versions 0.72.2 I'm in a process of installing CEPH on 4th node, but now CEPH version is 0.80.1 Will this make problems running mixed CEPH versions ? No, but the recommendation is not to have this running for a very long period. Try to upgrade all nodes to the same version within a reasonable amount of time. I intend to upgrade CEPH on exsiting 3 nodes anyway ? Recommended steps ? Always upgrade the monitors first! Then to the OSDs one by one. Thanks -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com mailto:ceph-us...@lists.ceph.__com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph._ ___com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902 tel:%2B31%20%280%2920%20700%__209902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com mailto:ceph-us...@lists.ceph.__com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph._ ___com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902 Skype: contact42on -- Andrija Panić -- http://admintweets.com -- -- Andrija Panić -- http://admintweets.com -- -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list
Re: [ceph-users] write performance per disk
On 07/03/2014 03:11 PM, VELARTIS Philipp Dürhammer wrote: Hi, I have a ceph cluster setup (with 45 sata disk journal on disks) and get only 450mb/sec writes seq (maximum playing around with threads in rados bench) with replica of 2 How many threads? Which is about ~20Mb writes per disk (what y see in atop also) theoretically with replica2 and having journals on disk should be 45 X 100mb (sata) / 2 (replica) / 2 (journal writes) which makes it 1125 satas in reality have 120mb/sec so the theoretical output should be more. I would expect to have between 40-50mb/sec for each sata disk Can somebody confirm that he can reach this speed with a setup with journals on the satas (with journals on ssd speed should be 100mb per disk)? or does ceph only give about ¼ of the speed for a disk? (and not the ½ as expected because of journals) Did you verify how much each machine is doing? It could be that the data is not distributed evenly and that on a certain machine the drives are doing 50MB/sec. My setup is 3 servers with: 2 x 2.6ghz xeons, 128gb ram 15 satas for ceph (and ssds for system) 1 x 10gig for external traffic, 1 x 10gig for osd traffic with reads I can saturate the network but writes is far away. And I would expect at least to saturate the 10gig with sequential writes also Should be possible, but with 3 servers the data distribution might not be optimal causing a lower write performance. I've seen 10Gbit write performance on multiple clusters without any problems. Thank you ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] what is the difference between snapshot and clone in theory?
hi,all what is the difference between snapshot and clone in theory? thanks___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] write performance per disk
HI, Ceph.conf: osd journal size = 15360 rbd cache = true rbd cache size = 2147483648 rbd cache max dirty = 1073741824 rbd cache max dirty age = 100 osd recovery max active = 1 osd max backfills = 1 osd mkfs options xfs = -f -i size=2048 osd mount options xfs = rw,noatime,nobarrier,logbsize=256k,logbufs=8,inode64,allocsize=4M osd op threads = 8 so it should be 8 threads? All 3 machines have more or less the same disk load at the same time. also the disks: sdb 35.5687.10 6849.09 617310 48540806 sdc 26.7572.62 5148.58 514701 36488992 sdd 35.1553.48 6802.57 378993 48211141 sde 31.0479.04 6208.48 560141 44000710 sdf 32.7938.35 6238.28 271805 44211891 sdg 31.6777.84 5987.45 551680 42434167 sdh 32.9551.29 6315.76 363533 44761001 sdi 31.6756.93 5956.29 403478 42213336 sdj 35.8377.82 6929.31 551501 49109354 sdk 36.8673.84 7291.00 523345 51672704 sdl 36.02 112.90 7040.47 800177 49897132 sdm 33.2538.02 6455.05 269446 45748178 sdn 33.5239.10 6645.19 277101 47095696 sdo 33.2646.22 6388.20 327541 45274394 sdp 33.3874.12 6480.62 525325 45929369 the question is: is this a poor performance to get max 500mb/write with 45 disks and replica 2 or should I expect this? -Ursprüngliche Nachricht- Von: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] Im Auftrag von Wido den Hollander Gesendet: Donnerstag, 03. Juli 2014 15:22 An: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] write performance per disk On 07/03/2014 03:11 PM, VELARTIS Philipp Dürhammer wrote: Hi, I have a ceph cluster setup (with 45 sata disk journal on disks) and get only 450mb/sec writes seq (maximum playing around with threads in rados bench) with replica of 2 How many threads? Which is about ~20Mb writes per disk (what y see in atop also) theoretically with replica2 and having journals on disk should be 45 X 100mb (sata) / 2 (replica) / 2 (journal writes) which makes it 1125 satas in reality have 120mb/sec so the theoretical output should be more. I would expect to have between 40-50mb/sec for each sata disk Can somebody confirm that he can reach this speed with a setup with journals on the satas (with journals on ssd speed should be 100mb per disk)? or does ceph only give about ¼ of the speed for a disk? (and not the ½ as expected because of journals) Did you verify how much each machine is doing? It could be that the data is not distributed evenly and that on a certain machine the drives are doing 50MB/sec. My setup is 3 servers with: 2 x 2.6ghz xeons, 128gb ram 15 satas for ceph (and ssds for system) 1 x 10gig for external traffic, 1 x 10gig for osd traffic with reads I can saturate the network but writes is far away. And I would expect at least to saturate the 10gig with sequential writes also Should be possible, but with 3 servers the data distribution might not be optimal causing a lower write performance. I've seen 10Gbit write performance on multiple clusters without any problems. Thank you ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] why lock th whole osd handle thread
when I see the function OSD::OpWQ::_process . I find pg lock locks the whole function. so when I use multi-thread write the same object , so are they must serialize from osd handle thread to journal write thread ? baijia...@126.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Multipart upload on ceph 0.8 doesn't work?
Hi, I'm trying to make multi part upload work. I'm using ceph 0.80-702-g9bac31b (from the ceph's github). I've tried the code provided by Mark Kirkwood here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/034940.html But unfortunately, it gives me the error: (multitest)pszablow@pat-desktop:~/$ python boto_multi.py begin upload of abc.yuv size 746496, 7 parts Traceback (most recent call last): File boto_multi.py, line 36, in module part = bucket.initiate_multipart_upload(objname) File /home/pszablow/venvs/multitest/local/lib/python2.7/site-packages/boto/s3/bucket.py, line 1742, in initiate_multipart_upload response.status, response.reason, body) boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden ?xml version=1.0 encoding=UTF-8?ErrorCodeAccessDenied/Code/Error The single part upload works for me. I am able to create buckets and objects. I've tried also other similar examples, but none of them works. Any ideas what's wrong? Does the ceph's multi part upload actually work for anybody? Thanks, Patrycja Szabłowska ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multipart upload on ceph 0.8 doesn't work?
I was at this issue this morning. It seems radosgw requires you to have a pool named '' to work with multipart. I just created a pool with that name rados mkpool '' either that or allow the pool be created by the radosgw... On 3 July 2014 16:27, Patrycja Szabłowska szablowska.patry...@gmail.com wrote: Hi, I'm trying to make multi part upload work. I'm using ceph 0.80-702-g9bac31b (from the ceph's github). I've tried the code provided by Mark Kirkwood here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/034940.html But unfortunately, it gives me the error: (multitest)pszablow@pat-desktop:~/$ python boto_multi.py begin upload of abc.yuv size 746496, 7 parts Traceback (most recent call last): File boto_multi.py, line 36, in module part = bucket.initiate_multipart_upload(objname) File /home/pszablow/venvs/multitest/local/lib/python2.7/site-packages/boto/s3/bucket.py, line 1742, in initiate_multipart_upload response.status, response.reason, body) boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden ?xml version=1.0 encoding=UTF-8?ErrorCodeAccessDenied/Code/Error The single part upload works for me. I am able to create buckets and objects. I've tried also other similar examples, but none of them works. Any ideas what's wrong? Does the ceph's multi part upload actually work for anybody? Thanks, Patrycja Szabłowska ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Luis Periquito Unix Engineer Ocado.com http://www.ocado.com/ Head Office, Titan Court, 3 Bishop Square, Hatfield Business Park, Hatfield, Herts AL10 9NE -- Notice: This email is confidential and may contain copyright material of members of the Ocado Group. Opinions and views expressed in this message may not necessarily reflect the opinions and views of the members of the Ocado Group. If you are not the intended recipient, please notify us immediately and delete all copies of this message. Please note that it is your responsibility to scan this message for viruses. References to the “Ocado Group” are to Ocado Group plc (registered in England and Wales with number 7098618) and its subsidiary undertakings (as that expression is defined in the Companies Act 2006) from time to time. The registered office of Ocado Group plc is Titan Court, 3 Bishops Square, Hatfield Business Park, Hatfield, Herts. AL10 9NE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Le 03/07/2014 13:49, Joao Eduardo Luis a écrit : On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because
Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds
It looks like you're just putting in data faster than your cluster can handle (in terms of IOPS). The first big hole (queue_op_wq-reached_pg) is it sitting in a queue and waiting for processing. The second parallel blocks are 1) write_thread_in_journal_buffer-journaled_completion_queued, and that is again a queue while it's waiting to be written to disk, 2) waiting for subops from [19,9]-sub_op_commit_received(x2) is waiting for the replica OSDs to write the transaction to disk. You might be able to tune it a little, but right now bucket indices live in one object, so every write has to touch the same set of OSDs (twice! to mark an object as putting, and put). 2*3/360 = 166, which is probably past what those disks can do, and artificially increasing the latency. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Jul 2, 2014 at 11:24 PM, baijia...@126.com baijia...@126.com wrote: hi, everyone when I user rest bench testing RGW with cmd : rest-bench --access-key=ak --secret=sk --bucket=bucket --seconds=360 -t 200 -b 524288 --no-cleanup write I found when RGW call the method bucket_prepare_op is very slow. so I observed from 'dump_historic_ops',to see: { description: osd_op(client.4211.0:265984 .dir.default.4148.1 [call rgw.bucket_prepare_op] 3.b168f3d0 e37), received_at: 2014-07-03 11:07:02.465700, age: 308.315230, duration: 3.401743, type_data: [ commit sent; apply or cleanup, { client: client.4211, tid: 265984}, [ { time: 2014-07-03 11:07:02.465852, event: waiting_for_osdmap}, { time: 2014-07-03 11:07:02.465875, event: queue op_wq}, { time: 2014-07-03 11:07:03.729087, event: reached_pg}, { time: 2014-07-03 11:07:03.729120, event: started}, { time: 2014-07-03 11:07:03.729126, event: started}, { time: 2014-07-03 11:07:03.804366, event: waiting for subops from [19,9]}, { time: 2014-07-03 11:07:03.804431, event: commit_queued_for_journal_write}, { time: 2014-07-03 11:07:03.804509, event: write_thread_in_journal_buffer}, { time: 2014-07-03 11:07:03.934419, event: journaled_completion_queued}, { time: 2014-07-03 11:07:05.297282, event: sub_op_commit_rec}, { time: 2014-07-03 11:07:05.297319, event: sub_op_commit_rec}, { time: 2014-07-03 11:07:05.311217, event: op_applied}, { time: 2014-07-03 11:07:05.867384, event: op_commit finish lock}, { time: 2014-07-03 11:07:05.867385, event: op_commit}, { time: 2014-07-03 11:07:05.867424, event: commit_sent}, { time: 2014-07-03 11:07:05.867428, event: op_commit finish}, { time: 2014-07-03 11:07:05.867443, event: done}]]}]} so I find 2 performance degradation. one is from queue op_wq to reached_pg , anothor is from journaled_completion_queued to op_commit. and I must stess that there are so many ops write to one bucket object, so how to reduce Latency ? baijia...@126.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Pools do not respond
The PG in question isn't being properly mapped to any OSDs. There's a good chance that those trees (with 3 OSDs in 2 hosts) aren't going to map well anyway, but the immediate problem should resolve itself if you change the choose to chooseleaf in your rules. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Jul 3, 2014 at 4:17 AM, Iban Cabrillo cabri...@ifca.unican.es wrote: Hi folk, I am following step by step the test intallation, and checking some configuration before try to deploy a production cluster. Now I have a Health cluster with 3 mons + 4 OSDs. I have created a pool with belonging all osd.x and two more one for two servers o the other for the other two. The general pool work fine (I can create images and mount it on remote machines). But the other two does not work (the commands rados put, or rbd ls pool hangs for ever). this is the tree: [ceph@cephadm ceph-cloud]$ sudo ceph osd tree # id weight type name up/down reweight -7 5.4 root 4x1GbFCnlSAS -3 2.7 host node04 1 2.7 osd.1 up 1 -4 2.7 host node03 2 2.7 osd.2 up 1 -6 8.1 root 4x4GbFCnlSAS -5 5.4 host node01 3 2.7 osd.3 up 1 4 2.7 osd.4 up 1 -2 2.7 host node04 0 2.7 osd.0 up 1 -1 13.5 root default -2 2.7 host node04 0 2.7 osd.0 up 1 -3 2.7 host node04 1 2.7 osd.1 up 1 -4 2.7 host node03 2 2.7 osd.2 up 1 -5 5.4 host node01 3 2.7 osd.3 up 1 4 2.7 osd.4 up 1 And this is the crushmap: ... root 4x4GbFCnlSAS { id -6 #do not change unnecessarily alg straw hash 0 # rjenkins1 item node01 weight 5.400 item node04 weight 2.700 } root 4x1GbFCnlSAS { id -7 #do not change unnecessarily alg straw hash 0 # rjenkins1 item node04 weight 2.700 item node03 weight 2.700 } # rules rule 4x4GbFCnlSAS { ruleset 1 type replicated min_size 1 max_size 10 step take 4x4GbFCnlSAS step choose firstn 0 type host step emit } rule 4x1GbFCnlSAS { ruleset 2 type replicated min_size 1 max_size 10 step take 4x1GbFCnlSAS step choose firstn 0 type host step emit } .. I of course set the crush_rules: sudo ceph osd pool set cloud-4x1GbFCnlSAS crush_ruleset 2 sudo ceph osd pool set cloud-4x4GbFCnlSAS crush_ruleset 1 but seems that are something wrong (4x4GbFCnlSAS.pool is 512MB file): sudo rados -p cloud-4x1GbFCnlSAS put 4x4GbFCnlSAS.object 4x4GbFCnlSAS.pool !!HANGS for eve! from the ceph-client happen the same rbd ls cloud-4x1GbFCnlSAS !!HANGS for eve! [root@cephadm ceph-cloud]# ceph osd map cloud-4x1GbFCnlSAS 4x1GbFCnlSAS.object osdmap e49 pool 'cloud-4x1GbFCnlSAS' (3) object '4x1GbFCnlSAS.object' - pg 3.114ae7a9 (3.29) - up ([], p-1) acting ([], p-1) Any idea what i am doing wrong?? Thanks in advance, I Bertrand Russell: El problema con el mundo es que los estúpidos están seguros de todo y los inteligentes están llenos de dudas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bypass Cache-Tiering for special reads (Backups)
On Wed, Jul 2, 2014 at 3:06 PM, Marc m...@shoowin.de wrote: Hi, I was wondering, having a cache pool in front of an RBD pool is all fine and dandy, but imagine you want to pull backups of all your VMs (or one of them, or multiple...). Going to the cache for all those reads isn't only pointless, it'll also potentially fill up the cache and possibly evict actually frequently used data. Which got me thinking... wouldn't it be nifty if there was a special way of doing specific backup reads where you'd bypass the cache, ensuring the dirty cache contents get written to cold pool first? Or at least doing special reads where a cache-miss won't actually cache the requested data? Yeah, these are nifty features but the cache coherency implications are a bit difficult. More options will come as we are able to develop and (more importantly, by far) validate them. -Greg AFAIK the backup routine for an RBD-backed KVM usually involves creating a snapshot of the RBD and putting that into a backup storage/tape, all done via librbd/API. Maybe something like that even already exists? KR, Marc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why lock th whole osd handle thread
On Thu, Jul 3, 2014 at 8:24 AM, baijia...@126.com baijia...@126.com wrote: when I see the function OSD::OpWQ::_process . I find pg lock locks the whole function. so when I use multi-thread write the same object , so are they must serialize from osd handle thread to journal write thread ? It's serialized while processing the write, but that doesn't include the wait time for the data to be placed on disk — merely sequencing it and feeding it into the journal queue. Writes have to be ordered, so that's not likely to change. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds
I find that the function of OSD::OpWQ::_process use pg-lock lock the whole function.so this mean that osd threads can't handle op which write for the same object. though add log to the ReplicatedPG::op_commit , I find pg lock cost long time sometimes. but I don't know where lock pg . where lock pg for a long time? thanks baijia...@126.com From: Gregory Farnum Date: 2014-07-04 01:02 To: baijia...@126.com CC: ceph-users Subject: Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds It looks like you're just putting in data faster than your cluster can handle (in terms of IOPS). The first big hole (queue_op_wq-reached_pg) is it sitting in a queue and waiting for processing. The second parallel blocks are 1) write_thread_in_journal_buffer-journaled_completion_queued, and that is again a queue while it's waiting to be written to disk, 2) waiting for subops from [19,9]-sub_op_commit_received(x2) is waiting for the replica OSDs to write the transaction to disk. You might be able to tune it a little, but right now bucket indices live in one object, so every write has to touch the same set of OSDs (twice! to mark an object as putting, and put). 2*3/360 = 166, which is probably past what those disks can do, and artificially increasing the latency. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Jul 2, 2014 at 11:24 PM, baijia...@126.com baijia...@126.com wrote: hi, everyone when I user rest bench testing RGW with cmd : rest-bench --access-key=ak --secret=sk --bucket=bucket --seconds=360 -t 200 -b 524288 --no-cleanup write I found when RGW call the method bucket_prepare_op is very slow. so I observed from 'dump_historic_ops',to see: { description: osd_op(client.4211.0:265984 .dir.default.4148.1 [call rgw.bucket_prepare_op] 3.b168f3d0 e37), received_at: 2014-07-03 11:07:02.465700, age: 308.315230, duration: 3.401743, type_data: [ commit sent; apply or cleanup, { client: client.4211, tid: 265984}, [ { time: 2014-07-03 11:07:02.465852, event: waiting_for_osdmap}, { time: 2014-07-03 11:07:02.465875, event: queue op_wq}, { time: 2014-07-03 11:07:03.729087, event: reached_pg}, { time: 2014-07-03 11:07:03.729120, event: started}, { time: 2014-07-03 11:07:03.729126, event: started}, { time: 2014-07-03 11:07:03.804366, event: waiting for subops from [19,9]}, { time: 2014-07-03 11:07:03.804431, event: commit_queued_for_journal_write}, { time: 2014-07-03 11:07:03.804509, event: write_thread_in_journal_buffer}, { time: 2014-07-03 11:07:03.934419, event: journaled_completion_queued}, { time: 2014-07-03 11:07:05.297282, event: sub_op_commit_rec}, { time: 2014-07-03 11:07:05.297319, event: sub_op_commit_rec}, { time: 2014-07-03 11:07:05.311217, event: op_applied}, { time: 2014-07-03 11:07:05.867384, event: op_commit finish lock}, { time: 2014-07-03 11:07:05.867385, event: op_commit}, { time: 2014-07-03 11:07:05.867424, event: commit_sent}, { time: 2014-07-03 11:07:05.867428, event: op_commit finish}, { time: 2014-07-03 11:07:05.867443, event: done}]]}]} so I find 2 performance degradation. one is from queue op_wq to reached_pg , anothor is from journaled_completion_queued to op_commit. and I must stess that there are so many ops write to one bucket object, so how to reduce Latency ? baijia...@126.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds
I put .rgw.buckets.index pool to SSD osd,bucket object must write to the SSD, and disk use ratio less than 50%. so I don't think disk is bottleneck baijia...@126.com From: baijia...@126.com Date: 2014-07-04 01:29 To: Gregory Farnum CC: ceph-users Subject: Re: Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds I find that the function of OSD::OpWQ::_process use pg-lock lock the whole function.so this mean that osd threads can't handle op which write for the same object. though add log to the ReplicatedPG::op_commit , I find pg lock cost long time sometimes. but I don't know where lock pg . where lock pg for a long time? thanks baijia...@126.com From: Gregory Farnum Date: 2014-07-04 01:02 To: baijia...@126.com CC: ceph-users Subject: Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds It looks like you're just putting in data faster than your cluster can handle (in terms of IOPS). The first big hole (queue_op_wq-reached_pg) is it sitting in a queue and waiting for processing. The second parallel blocks are 1) write_thread_in_journal_buffer-journaled_completion_queued, and that is again a queue while it's waiting to be written to disk, 2) waiting for subops from [19,9]-sub_op_commit_received(x2) is waiting for the replica OSDs to write the transaction to disk. You might be able to tune it a little, but right now bucket indices live in one object, so every write has to touch the same set of OSDs (twice! to mark an object as putting, and put). 2*3/360 = 166, which is probably past what those disks can do, and artificially increasing the latency. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Jul 2, 2014 at 11:24 PM, baijia...@126.com baijia...@126.com wrote: hi, everyone when I user rest bench testing RGW with cmd : rest-bench --access-key=ak --secret=sk --bucket=bucket --seconds=360 -t 200 -b 524288 --no-cleanup write I found when RGW call the method bucket_prepare_op is very slow. so I observed from 'dump_historic_ops',to see: { description: osd_op(client.4211.0:265984 .dir.default.4148.1 [call rgw.bucket_prepare_op] 3.b168f3d0 e37), received_at: 2014-07-03 11:07:02.465700, age: 308.315230, duration: 3.401743, type_data: [ commit sent; apply or cleanup, { client: client.4211, tid: 265984}, [ { time: 2014-07-03 11:07:02.465852, event: waiting_for_osdmap}, { time: 2014-07-03 11:07:02.465875, event: queue op_wq}, { time: 2014-07-03 11:07:03.729087, event: reached_pg}, { time: 2014-07-03 11:07:03.729120, event: started}, { time: 2014-07-03 11:07:03.729126, event: started}, { time: 2014-07-03 11:07:03.804366, event: waiting for subops from [19,9]}, { time: 2014-07-03 11:07:03.804431, event: commit_queued_for_journal_write}, { time: 2014-07-03 11:07:03.804509, event: write_thread_in_journal_buffer}, { time: 2014-07-03 11:07:03.934419, event: journaled_completion_queued}, { time: 2014-07-03 11:07:05.297282, event: sub_op_commit_rec}, { time: 2014-07-03 11:07:05.297319, event: sub_op_commit_rec}, { time: 2014-07-03 11:07:05.311217, event: op_applied}, { time: 2014-07-03 11:07:05.867384, event: op_commit finish lock}, { time: 2014-07-03 11:07:05.867385, event: op_commit}, { time: 2014-07-03 11:07:05.867424, event: commit_sent}, { time: 2014-07-03 11:07:05.867428, event: op_commit finish}, { time: 2014-07-03 11:07:05.867443, event: done}]]}]} so I find 2 performance degradation. one is from queue op_wq to reached_pg , anothor is from journaled_completion_queued to op_commit. and I must stess that there are so many ops write to one bucket object, so how to reduce Latency ? baijia...@126.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Pools do not respond
Hi Gregory, Thanks a lot I begin to understand who ceph works. I add a couple of osd servers, and balance the disk between them. [ceph@cephadm ceph-cloud]$ sudo ceph osd tree # idweighttype nameup/downreweight -716.2root 4x1GbFCnlSAS -95.4host node02 72.7osd.7up1 82.7osd.8up1 -45.4host node03 22.7osd.2up1 92.7osd.9up1 -35.4host node04 12.7osd.1up1 102.7osd.10up1 -616.2root 4x4GbFCnlSAS -55.4host node01 32.7osd.3up1 42.7osd.4up1 -85.4host node02 52.7osd.5up1 62.7osd.6up1 -25.4host node04 02.7osd.0up1 112.7osd.11up1 -132.4root default -25.4host node04 02.7osd.0up1 112.7osd.11up1 -35.4host node04 12.7osd.1up1 102.7osd.10up1 -45.4host node03 22.7osd.2up1 92.7osd.9up1 -55.4host node01 32.7osd.3up1 42.7osd.4up1 -85.4host node02 52.7osd.5up1 62.7osd.6up1 -95.4host node02 72.7osd.7up1 82.7osd.8up1 The Idea Is to have at least 4 servers and 3 disk (2.7 TB SAN attached) for server per pool. Now i have to adjust the pg and pgp and make some performance test. PD which is the difference betwwwn chose ans choseleaf? Thanks a lot! 2014-07-03 19:06 GMT+02:00 Gregory Farnum g...@inktank.com: The PG in question isn't being properly mapped to any OSDs. There's a good chance that those trees (with 3 OSDs in 2 hosts) aren't going to map well anyway, but the immediate problem should resolve itself if you change the choose to chooseleaf in your rules. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Jul 3, 2014 at 4:17 AM, Iban Cabrillo cabri...@ifca.unican.es wrote: Hi folk, I am following step by step the test intallation, and checking some configuration before try to deploy a production cluster. Now I have a Health cluster with 3 mons + 4 OSDs. I have created a pool with belonging all osd.x and two more one for two servers o the other for the other two. The general pool work fine (I can create images and mount it on remote machines). But the other two does not work (the commands rados put, or rbd ls pool hangs for ever). this is the tree: [ceph@cephadm ceph-cloud]$ sudo ceph osd tree # id weight type name up/down reweight -7 5.4 root 4x1GbFCnlSAS -3 2.7 host node04 1 2.7 osd.1 up 1 -4 2.7 host node03 2 2.7 osd.2 up 1 -6 8.1 root 4x4GbFCnlSAS -5 5.4 host node01 3 2.7 osd.3 up 1 4 2.7 osd.4 up 1 -2 2.7 host node04 0 2.7 osd.0 up 1 -1 13.5 root default -2 2.7 host node04 0 2.7 osd.0 up 1 -3 2.7 host node04 1 2.7 osd.1 up 1 -4 2.7 host node03 2 2.7 osd.2 up 1 -5 5.4 host node01 3 2.7 osd.3 up 1 4 2.7 osd.4 up 1 And this is the crushmap: ... root 4x4GbFCnlSAS { id -6 #do not change unnecessarily alg straw hash 0 # rjenkins1 item node01 weight 5.400 item node04 weight 2.700 } root 4x1GbFCnlSAS { id -7 #do not change unnecessarily alg straw hash 0 # rjenkins1 item node04 weight 2.700 item node03 weight 2.700 } # rules rule 4x4GbFCnlSAS { ruleset 1 type replicated min_size 1 max_size 10 step take 4x4GbFCnlSAS step choose firstn 0 type host step emit } rule 4x1GbFCnlSAS { ruleset 2 type replicated min_size 1 max_size 10 step take 4x1GbFCnlSAS step choose firstn 0 type host step emit } .. I of course set the crush_rules: sudo ceph osd pool set cloud-4x1GbFCnlSAS crush_ruleset 2 sudo ceph osd pool set cloud-4x4GbFCnlSAS crush_ruleset 1 but seems that are something wrong (4x4GbFCnlSAS.pool is 512MB file): sudo rados -p cloud-4x1GbFCnlSAS put 4x4GbFCnlSAS.object 4x4GbFCnlSAS.pool !!HANGS for eve! from the ceph-client happen the same rbd ls cloud-4x1GbFCnlSAS !!HANGS for eve! [root@cephadm ceph-cloud]# ceph osd map cloud-4x1GbFCnlSAS 4x1GbFCnlSAS.object osdmap e49 pool 'cloud-4x1GbFCnlSAS' (3) object '4x1GbFCnlSAS.object' - pg 3.114ae7a9 (3.29) - up ([], p-1) acting ([], p-1) Any idea what i am doing wrong?? Thanks in advance, I Bertrand Russell: El problema con el mundo es que los estúpidos
Re: [ceph-users] Pools do not respond
On Thu, Jul 3, 2014 at 11:17 AM, Iban Cabrillo cabri...@ifca.unican.es wrote: Hi Gregory, Thanks a lot I begin to understand who ceph works. I add a couple of osd servers, and balance the disk between them. [ceph@cephadm ceph-cloud]$ sudo ceph osd tree # idweighttype nameup/downreweight -716.2root 4x1GbFCnlSAS -95.4host node02 72.7osd.7up1 82.7osd.8up1 -45.4host node03 22.7osd.2up1 92.7osd.9up1 -35.4host node04 12.7osd.1up1 102.7osd.10up1 -616.2root 4x4GbFCnlSAS -55.4host node01 32.7osd.3up1 42.7osd.4up1 -85.4host node02 52.7osd.5up1 62.7osd.6up1 -25.4host node04 02.7osd.0up1 112.7osd.11up1 -132.4root default -25.4host node04 02.7osd.0up1 112.7osd.11up1 -35.4host node04 12.7osd.1up1 102.7osd.10up1 -45.4host node03 22.7osd.2up1 92.7osd.9up1 -55.4host node01 32.7osd.3up1 42.7osd.4up1 -85.4host node02 52.7osd.5up1 62.7osd.6up1 -95.4host node02 72.7osd.7up1 82.7osd.8up1 The Idea Is to have at least 4 servers and 3 disk (2.7 TB SAN attached) for server per pool. Now i have to adjust the pg and pgp and make some performance test. PD which is the difference betwwwn chose ans choseleaf? choose instructs the system to choose N different buckets of the given type (where N is specified by the firstn 0 block to be the replication level, but could be 1: firstn 1, or replication - 1: firstn -1). Since you're saying choose firstn 0 type host, that's what you're getting out, and then you're emitting those 3 (by default) hosts. But they aren't valid devices (OSDs), so it's not a valid mapping; you're supposed to then say choose firstn 1 device or similar. chooseleaf instead tells the system to choose N different buckets, and then descend from each of those buckets to a leaf (device) in the CRUSH hierarchy. It's a little more robust against different mappings and failure conditions, so generally a better choice than choose if you don't need the finer granularity provided by choose. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Do those logs have a higher debugging level than the default? If not nevermind as they will not have enough information. If they do however, we'd be interested in the portion around the moment you set the tunables. Say, before the upgrade and a bit after you set the tunable. If you want to be finer grained, then ideally it would be the moment where those maps were created, but you'd have to grep the logs for that. Or drop the logs somewhere and I'll take a look. -Joao On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Le 03/07/2014 13:49, Joao Eduardo Luis a écrit : On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_ 4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_ 4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20
[ceph-users] mon: leveldb checksum mismatch
Hi list — I’ve got a small dev. cluster: 3 OSD nodes with 6 disks/OSDs each and a single monitor (this, it seems, was my mistake). The monitor node went down hard and it looks like the monitor’s db is in a funny state. Running ‘ceph-mon’ manually with ‘debug_mon 20’ and ‘debug_ms 20’ gave the following: /usr/bin/ceph-mon -i monhost --mon-data /var/lib/ceph/mon/ceph-monhost --debug_mon 20 --debug_ms 20 -d 2014-07-03 23:20:55.800512 7f973918e7c0 0 ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73), process ceph-mon, pid 24930 Corruption: checksum mismatch Corruption: checksum mismatch 2014-07-03 23:20:56.455797 7f973918e7c0 -1 failed to create new leveldb store I attempted to make use of the leveldb Python library’s ‘RepairDB’ function, which just moves enough files into ‘lost’ that when running the monitor again I’m asked if I ran mkcephfs. Any insight into resolving these two checksum mismatches so I can access my OSD data would be greatly appreciated. Thanks, ./JRH p.s. I’m assuming that without the maps from the monitor, my OSD data is unrecoverable also. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon: leveldb checksum mismatch
On 07/04/2014 12:29 AM, Jason Harley wrote: Hi list — I’ve got a small dev. cluster: 3 OSD nodes with 6 disks/OSDs each and a single monitor (this, it seems, was my mistake). The monitor node went down hard and it looks like the monitor’s db is in a funny state. Running ‘ceph-mon’ manually with ‘debug_mon 20’ and ‘debug_ms 20’ gave the following: /usr/bin/ceph-mon -i monhost --mon-data /var/lib/ceph/mon/ceph-monhost --debug_mon 20 --debug_ms 20 -d 2014-07-03 23:20:55.800512 7f973918e7c0 0 ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73), process ceph-mon, pid 24930 Corruption: checksum mismatch Corruption: checksum mismatch 2014-07-03 23:20:56.455797 7f973918e7c0 -1 failed to create new leveldb store I attempted to make use of the leveldb Python library’s ‘RepairDB’ function, which just moves enough files into ‘lost’ that when running the monitor again I’m asked if I ran mkcephfs. Any insight into resolving these two checksum mismatches so I can access my OSD data would be greatly appreciated. Thanks, ./JRH p.s. I’m assuming that without the maps from the monitor, my OSD data is unrecoverable also. Hello Jason, We don't have a way to repair leveldb. Having multiple monitors usually help with such tricky situations. According to this [1] the python bindings you're using may not be linked into snappy, which we were using (mistakenly until recently) to compress data as it goes into leveldb. Not having those snappy bindings may be what's causing all those files to be moved to lost instead. The suggestion that the thread in [1] offers is to have the repair functionality directly in the 'application' itself. We could do this by adding a repair option to ceph-kvstore-tool -- which could help. I'll be happy to get that into ceph-kvstore-tool tomorrow and push a branch for you to compile and test. -Joao [1] - https://groups.google.com/forum/#!topic/leveldb/YvszWNio2-Q -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon: leveldb checksum mismatch
Hi Joao, On Jul 3, 2014, at 7:57 PM, Joao Eduardo Luis joao.l...@inktank.com wrote: We don't have a way to repair leveldb. Having multiple monitors usually help with such tricky situations. I know this, but for this small dev cluster I wasn’t thinking about corruption of my mon’s backing store. Silly me :) According to this [1] the python bindings you're using may not be linked into snappy, which we were using (mistakenly until recently) to compress data as it goes into leveldb. Not having those snappy bindings may be what's causing all those files to be moved to lost instead. I found the same posting, and confirmed that the ‘levedb.so’ that ships with the ‘python-leveldb’ package on Ubuntu 13.10 links against ‘snappy’. The suggestion that the thread in [1] offers is to have the repair functionality directly in the 'application' itself. We could do this by adding a repair option to ceph-kvstore-tool -- which could help. I'll be happy to get that into ceph-kvstore-tool tomorrow and push a branch for you to compile and test. I would be more than happy to try this out. Without fixing these checksums, I think I’m reinitializing my cluster. :\ Thank you, ./JRH___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com