Re: [ceph-users] OSD Weights
Hi, As far as I know Ceph won't attempt to do any weight modifications. If you use the default CRUSH map, every devices get a default weight of 1. However this value can be modified while the cluster runs. Simply update the CRUSH map like so: # ceph osd crush reweight {name} {weight} If you need more input, have a look at the documentation ;-) http://ceph.com/docs/master/rados/operations/crush-map/?highlight=crush#adjust-an-osd-s-crush-weight Cheers, -- Regards, Sébastien Han. On Wed, Feb 13, 2013 at 4:23 PM, sheng qiu herbert1984...@gmail.com wrote: Hi Gregory, once running ceph online, will ceph change the weight dynamically (if not set properly) or it can only be changed by the user through command line or it cannot be changed online? Thanks, Sheng On Mon, Feb 11, 2013 at 3:31 PM, Gregory Farnum g...@inktank.com wrote: On Mon, Feb 11, 2013 at 12:43 PM, Holcombe, Christopher cholc...@cscinfo.com wrote: Hi Everyone, I just wanted to confirm my thoughts on the ceph osd weightings. My understanding is they are a statistical distribution number. My current setup has 3TB hard drives and they all have the default weight of 1. I was thinking that if I mixed in 4TB hard drives in the future it would only put 3TB of data on them. I thought if I changed the weight to 3 for the 3TB hard drives and 4 for the 4TB hard drives it would correctly use the larger storage disks. Is that correct? Yep, looks good. -Greg PS: This is a good question for the new ceph-users list. (http://ceph.com/community/introducing-ceph-users/) :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Sheng Qiu Texas A M University Room 332B Wisenbaker email: herbert1984...@gmail.com College Station, TX 77843-3259 ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
urgent journal conf on ceph.conf
Pls can someone help me with the ceph.conf for 0.56.2. I have two servers for STORAGE with 3tb hard drives each and two SSD's each. I want to use OSD data on the hard drive and osd journal on SSD. I want to know how osd journal configuration is set to SSD. My SSD is mounted on /dev/sdb. I have tried the osd data configurations devs = /dev/sda and it worked just good. Is this line correct osd journal = /dev/osd$id/journal ?? and osd journal = /dev/sdb ??? [global] auth cluster required = cephx auth service required = cephx auth client required = cephx debug ms = 1 [osd] osd journal size = 1000 osd journal = /dev/osd$id/journal filestore xattr use omap = true osd mkfs type = xfs osd mkfs options xfs = -f osd mount options xfs = rw,noatime, [osd.0] host = server04 devs = /dev/sda osd journal = /dev/sdb [osd.1] host = server05 devs = /dev/sda osd journal = /dev/sdb THANKS. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD dies after seconds
I upgraded to ceph 0.56-3 but the problem persist... OSD starts but after a second it finishes: 2013-02-14 12:18:34.504391 7fae613ea760 10 journal _open journal is not a block device, NOT checking disk write cache on '/var/lib/ceph/osd/ceph-0/jour nal' 2013-02-14 12:18:34.504400 7fae613ea760 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 17: 1048576 bytes, block size 4096 bytes, directio = 1 , aio = 0 2013-02-14 12:18:34.504458 7fae613ea760 10 journal journal_start 2013-02-14 12:18:34.504506 7fae5d3c6700 10 journal write_thread_entry start 2013-02-14 12:18:34.504515 7fae5d3c6700 20 journal write_thread_entry going to sleep 2013-02-14 12:18:34.504706 7fae5cbc5700 10 journal write_finish_thread_entry enter 2013-02-14 12:18:34.504716 7fae5cbc5700 20 journal write_finish_thread_entry sleeping 2013-02-14 12:18:34.504893 7fae567fc700 20 filestore(/var/lib/ceph/osd/ceph-0) flusher_entry start 2013-02-14 12:18:34.504903 7fae567fc700 20 filestore(/var/lib/ceph/osd/ceph-0) flusher_entry sleeping 2013-02-14 12:18:34.505013 7fae613ea760 5 filestore(/var/lib/ceph/osd/ceph-0) umount /var/lib/ceph/osd/ceph-0 2013-02-14 12:18:34.505036 7fae567fc700 20 filestore(/var/lib/ceph/osd/ceph-0) flusher_entry awoke 2013-02-14 12:18:34.505044 7fae567fc700 20 filestore(/var/lib/ceph/osd/ceph-0) flusher_entry finish 2013-02-14 12:18:34.505113 7fae5dbc7700 20 filestore(/var/lib/ceph/osd/ceph-0) sync_entry force_sync set 2013-02-14 12:18:34.505129 7fae5dbc7700 10 journal commit_start max_applied_seq 2, open_ops 0 2013-02-14 12:18:34.505136 7fae5dbc7700 10 journal commit_start blocked, all open_ops have completed 2013-02-14 12:18:34.505138 7fae5dbc7700 10 journal commit_start nothing to do 2013-02-14 12:18:34.505141 7fae5dbc7700 10 journal commit_start 2013-02-14 12:18:34.505506 7fae613ea760 10 journal journal_stop 2013-02-14 12:18:34.505698 7fae613ea760 1 journal close /var/lib/ceph/osd/ceph-0/journal 2013-02-14 12:18:34.505787 7fae5d3c6700 20 journal write_thread_entry woke up 2013-02-14 12:18:34.505796 7fae5d3c6700 10 journal write_thread_entry finish 2013-02-14 12:18:34.505845 7fae5cbc5700 10 journal write_finish_thread_entry exit On Wed, Feb 13, 2013 at 6:28 PM, Jesus Cuenca jcue...@cnb.csic.es wrote: thanks for the fast answer. no, it does not segfault: gdb --args /usr/local/bin/ceph-osd -i 0 ... (gdb) run Starting program: /usr/local/bin/ceph-osd -i 0 [Thread debugging using libthread_db enabled] [New Thread 0x75fce700 (LWP 8920)] starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal [Thread 0x75fce700 (LWP 8920) exited] Program exited normally. -- On Wed, Feb 13, 2013 at 6:21 PM, Sage Weil s...@inktank.com wrote: On Wed, 13 Feb 2013, Jesus Cuenca wrote: Hi, I'm setting up a small ceph 0.56.2 cluster on 3 64-bit Debian 6 servers with kernel 3.7.2. This might be http://tracker.ceph.com/issues/3595 which is problems with google perftools (which we use by default) and the version in squeeze, which is buggy. This doesn't seem to affect all squeeze users. Does it seg fault? sage My problem is that OSD die. First I try to start them with the init script: /etc/init.d/ceph start osd.0 ... starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal ps -ef | grep ceph (No ceph-osd process) I then run with debugging: ceph-osd -i 0 --debug_ms 20 --debug_osd 20 --debug_filestore 20 --debug_journal 20 -d starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal 2013-02-13 18:04:40.351830 7fe98cd8a760 10 -- :/0 rank.bind :/0 2013-02-13 18:04:40.351895 7fe98cd8a760 10 accepter.accepter.bind 2013-02-13 18:04:40.351910 7fe98cd8a760 10 accepter.accepter.bind bound on random port 0.0.0.0:6800/0 2013-02-13 18:04:40.351919 7fe98cd8a760 10 accepter.accepter.bind bound to 0.0.0.0:6800/0 2013-02-13 18:04:40.351930 7fe98cd8a760 1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6800/8438 need_addr=1 2013-02-13 18:04:40.351935 7fe98cd8a760 10 -- :/0 rank.bind :/0 2013-02-13 18:04:40.351938 7fe98cd8a760 10 accepter.accepter.bind 2013-02-13 18:04:40.351943 7fe98cd8a760 10 accepter.accepter.bind bound on random port 0.0.0.0:6801/0 2013-02-13 18:04:40.351946 7fe98cd8a760 10 accepter.accepter.bind bound to 0.0.0.0:6801/0 2013-02-13 18:04:40.351952 7fe98cd8a760 1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6801/8438 need_addr=1 2013-02-13 18:04:40.351959 7fe98cd8a760 10 -- :/0 rank.bind :/0 2013-02-13 18:04:40.351961 7fe98cd8a760 10 accepter.accepter.bind 2013-02-13 18:04:40.351966 7fe98cd8a760 10 accepter.accepter.bind bound on random port 0.0.0.0:6802/0 2013-02-13 18:04:40.351969 7fe98cd8a760 10 accepter.accepter.bind bound to 0.0.0.0:6802/0 2013-02-13 18:04:40.351975 7fe98cd8a760 1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6802/8438 need_addr=1 2013-02-13 18:04:40.352636 7fe98cd8a760 5 filestore(/var/lib/ceph/osd/ceph-0) basedir
Re: urgent journal conf on ceph.conf
On 02/14/2013 11:24 AM, charles L wrote: Pls can someone help me with the ceph.conf for 0.56.2. I have two servers for STORAGE with 3tb hard drives each and two SSD's each. I want to use OSD data on the hard drive and osd journal on SSD. I want to know how osd journal configuration is set to SSD. My SSD is mounted on /dev/sdb. I have tried the osd data configurations devs = /dev/sda and it worked just good. Is this line correct osd journal = /dev/osd$id/journal ?? and osd journal = /dev/sdb ??? In the osd specific (osd.0 and osd.1) sections you override the journal settings made in the [osd] section. They are not needed, since you give the whole block device (/dev/sdb) to the OSD as a journal. Are you sure /dev/sda is available for the OSD and it's not your boot device? Wido [global] auth cluster required = cephx auth service required = cephx auth client required = cephx debug ms = 1 [osd] osd journal size = 1000 osd journal = /dev/osd$id/journal filestore xattr use omap = true osd mkfs type = xfs osd mkfs options xfs = -f osd mount options xfs = rw,noatime, [osd.0] host = server04 devs = /dev/sda osd journal = /dev/sdb [osd.1] host = server05 devs = /dev/sda osd journal = /dev/sdb THANKS. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: urgent journal conf on ceph.conf
Including ceph-users, as it feels like this belongs there :-) On 02/14/2013 01:47 PM, Wido den Hollander wrote: On 02/14/2013 11:24 AM, charles L wrote: Pls can someone help me with the ceph.conf for 0.56.2. I have two servers for STORAGE with 3tb hard drives each and two SSD's each. I want to use OSD data on the hard drive and osd journal on SSD. I want to know how osd journal configuration is set to SSD. My SSD is mounted on /dev/sdb. I have tried the osd data configurations devs = /dev/sda and it worked just good. Is this line correct osd journal = /dev/osd$id/journal ?? and osd journal = /dev/sdb ??? In the osd specific (osd.0 and osd.1) sections you override the journal settings made in the [osd] section. They are not needed, since you give the whole block device (/dev/sdb) to the OSD as a journal. Are you sure /dev/sda is available for the OSD and it's not your boot device? Wido [global] auth cluster required = cephx auth service required = cephx auth client required = cephx debug ms = 1 [osd] osd journal size = 1000 osd journal = /dev/osd$id/journal filestore xattr use omap = true osd mkfs type = xfs osd mkfs options xfs = -f osd mount options xfs = rw,noatime, [osd.0] host = server04 devs = /dev/sda osd journal = /dev/sdb [osd.1] host = server05 devs = /dev/sda osd journal = /dev/sdb THANKS. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] urgent journal conf on ceph.conf
+1 for Wido Moreover, if you want to store the journal on a block device, you should partition your journal disk and assign one partition per OSD like /dev/sdb1, 2 , 3 Again, osd journal = /dev/osd$id/journal is wrong, if you use this directive, this must point to a filesystem because the journal will be a file. Anyway, as far I'm concerned I didn't notice that much performance gain by using the journal on a block device. At the end, just put the journal on a dedicated formatted partition since the filesystem overhead is not that big. So just keep the osd journal = /dev/osd$id/journal but change it for something like osd journal = /srv/ceph/journals/osd$id/journal. Cheers. -- Regards, Sébastien Han. On Thu, Feb 14, 2013 at 2:52 PM, Joao Eduardo Luis joao.l...@inktank.com wrote: Including ceph-users, as it feels like this belongs there :-) On 02/14/2013 01:47 PM, Wido den Hollander wrote: On 02/14/2013 11:24 AM, charles L wrote: Pls can someone help me with the ceph.conf for 0.56.2. I have two servers for STORAGE with 3tb hard drives each and two SSD's each. I want to use OSD data on the hard drive and osd journal on SSD. I want to know how osd journal configuration is set to SSD. My SSD is mounted on /dev/sdb. I have tried the osd data configurations devs = /dev/sda and it worked just good. Is this line correct osd journal = /dev/osd$id/journal ?? and osd journal = /dev/sdb ??? In the osd specific (osd.0 and osd.1) sections you override the journal settings made in the [osd] section. They are not needed, since you give the whole block device (/dev/sdb) to the OSD as a journal. Are you sure /dev/sda is available for the OSD and it's not your boot device? Wido [global] auth cluster required = cephx auth service required = cephx auth client required = cephx debug ms = 1 [osd] osd journal size = 1000 osd journal = /dev/osd$id/journal filestore xattr use omap = true osd mkfs type = xfs osd mkfs options xfs = -f osd mount options xfs = rw,noatime, [osd.0] host = server04 devs = /dev/sda osd journal = /dev/sdb [osd.1] host = server05 devs = /dev/sda osd journal = /dev/sdb THANKS. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
radosgw: Update a key's meta data
Hi, I was wondering how I could update a key's metadata like the Content-Type. The solution on S3 seem to be to copy the key on itself and replacing meta data. If I do that in ceph, will it work ? And more importantly, will it be done intelligently (i.e. without copying the actual file data around). I tried reading the code, but although part of the code seem to hint at support for this (in rgw_rest_s3.cc), some other part seem to not look at all if the src == dst (like rgw_op.cc). Cheers, Sylvain Munaut -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid)
Hi List, I get reproducible this assertion, how can I help to debug it? -martin (Lese Datenbank ... 52246 Dateien und Verzeichnisse sind derzeit installiert.) Vorbereitung zum Ersetzen von linux-firmware 1.79 (durch .../linux-firmware_1.79.1_all.deb) ... Ersatz für linux-firmware wird entpackt ... osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int)' thread 7f72b7fff700 time 2013-02-14 16:04:48.867285 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid) ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061) 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long, unsigned long, int)+0xd68) [0x7f72d4050848] 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f72d405742b] 3: (Context::complete(int)+0xa) [0x7f72d400f9ba] 4: (librbd::C_Request::finish(int)+0x85) [0x7f72d403f145] 5: (Context::complete(int)+0xa) [0x7f72d400f9ba] 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f72d40241b7] 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f72d33db16d] 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f72d3444e50] 9: (()+0x7e9a) [0x7f72d03c7e9a] 10: (clone()+0x6d) [0x7f72d00f4cbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid)
Hi Martin- On Thu, 14 Feb 2013, Martin Mailand wrote: Hi List, I get reproducible this assertion, how can I help to debug it? Can you describe the workload? Are the OSDs also running 0.56.2(+)? Any other activity on the server side (data migration, OSD failure, etc.) that may have contributed? We just reopened http://tracker.ceph.com/issues/2947 to track this. I'm working on reproducing it now as well. Thanks! sage -martin (Lese Datenbank ... 52246 Dateien und Verzeichnisse sind derzeit installiert.) Vorbereitung zum Ersetzen von linux-firmware 1.79 (durch .../linux-firmware_1.79.1_all.deb) ... Ersatz f?r linux-firmware wird entpackt ... osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int)' thread 7f72b7fff700 time 2013-02-14 16:04:48.867285 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid) ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061) 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long, unsigned long, int)+0xd68) [0x7f72d4050848] 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f72d405742b] 3: (Context::complete(int)+0xa) [0x7f72d400f9ba] 4: (librbd::C_Request::finish(int)+0x85) [0x7f72d403f145] 5: (Context::complete(int)+0xa) [0x7f72d400f9ba] 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f72d40241b7] 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f72d33db16d] 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f72d3444e50] 9: (()+0x7e9a) [0x7f72d03c7e9a] 10: (clone()+0x6d) [0x7f72d00f4cbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Simple doc update pull request
Merged. Thanks, Travis! On Thu, 14 Feb 2013, Travis Rhoden wrote: Hey folks, I submitted a pull-request for some simple doc updates related to cephx and creating new keys/clients. Please take a look when possible. https://github.com/ceph/ceph/pull/56 - Travis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions on some minor issues when upgrading from 0.48 to 0.56
Thanks Wido for the clarifications. I guessed this means that I can update the OSD cluster to 0.56, and client that was compiled with the old librados2 0.48 should still be able to access the cluster without any change. Client that compiled with the new librados2 0.56 (API level 0.48) has to fix that rados_pool_list before it can access the cluster. DanielH - Original Message - From: Wido den Hollander w...@42on.com To: Daniel Hoang daniel_m_ho...@yahoo.com Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org Sent: Wednesday, February 13, 2013 11:18 PM Subject: Re: Questions on some minor issues when upgrading from 0.48 to 0.56 Hi, On 02/13/2013 08:26 PM, Daniel Hoang wrote: Hi All, Just in case these issues have not been reported yet, I am on ubuntu 12.04, upgrade librados2/librados-dev from 0.48 to 0.56, and I notice the following issues: 1. librados2 / librados-dev still reports minor version as 48 Should minor version changed to 56? No. From what I understand the librados version is only bumped when the API actually changes. This indicates that the API is still the same as 0.48 2. In 0.48, rados_pool_list(cluster, NULL, 0) can be used like a buffer size query, and it would return the buffer size required for the pool list buffer string. In 0.56, this call now returns error -22 instead, and I have to pass in a tmp_buf[32] and len = 32 in order for the call to return successfully. I check the current rados_pool_list API, and it does not mention anything about buffer and len should not be NULL, 0. May be this was a bug in 0.48, that we should not pass in NULL, 0 ? Take a look at this commit: https://github.com/ceph/ceph/commit/a677f47926b9640c53fbd00c94d6eb7a590a94fc I ran into this with phprados as well: https://github.com/ceph/phprados/commit/ee8b87fe93f87f92a7c3fa197a33b3d2de2fc4b6 Wido Thanks for your help, DanielH -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid)
Hi Sage, everything is on 0.56.2 and the cluster is healthy. I can reproduce it with an apt-get upgrade within the vm, the vm os is 12.04. Most of the time the assertion happened when the firmware .deb is updated. See the log in my first email. But I use a custom build qemu version (1.4-rc1), which was build against 0.56.2. root@store1:~# ceph -s health HEALTH_OK monmap e1: 1 mons at {a=192.168.195.33:6789/0}, election epoch 1, quorum 0 a osdmap e160: 20 osds: 20 up, 20 in pgmap v28314: 3264 pgs: 3264 active+clean; 437 GB data, 1027 GB used, 144 TB / 145 TB avail mdsmap e1: 0/0/1 up root@store1:~# ceph --version ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061) root@compute4:~# dpkg -l|grep 'rbd\|rados\|qemu' ii librados20.56.2-1precise RADOS distributed object store client library ii librbd1 0.56.2-1precise RADOS block device client library ii qemu-common 1.4.0-rc1-vdsp1.0 qemu common functionality (bios, documentation, etc) ii qemu-kvm 1.4.0-rc1-vdsp1.0 Full virtualization on i386 and amd64 hardware ii qemu-utils 1.4.0-rc1-vdsp1.0 qemu utilities -martin On 14.02.2013 18:18, Sage Weil wrote: Hi Martin- On Thu, 14 Feb 2013, Martin Mailand wrote: Hi List, I get reproducible this assertion, how can I help to debug it? Can you describe the workload? Are the OSDs also running 0.56.2(+)? Any other activity on the server side (data migration, OSD failure, etc.) that may have contributed? We just reopened http://tracker.ceph.com/issues/2947 to track this. I'm working on reproducing it now as well. Thanks! sage -martin (Lese Datenbank ... 52246 Dateien und Verzeichnisse sind derzeit installiert.) Vorbereitung zum Ersetzen von linux-firmware 1.79 (durch .../linux-firmware_1.79.1_all.deb) ... Ersatz f?r linux-firmware wird entpackt ... osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int)' thread 7f72b7fff700 time 2013-02-14 16:04:48.867285 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid) ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061) 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long, unsigned long, int)+0xd68) [0x7f72d4050848] 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f72d405742b] 3: (Context::complete(int)+0xa) [0x7f72d400f9ba] 4: (librbd::C_Request::finish(int)+0x85) [0x7f72d403f145] 5: (Context::complete(int)+0xa) [0x7f72d400f9ba] 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f72d40241b7] 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f72d33db16d] 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f72d3444e50] 9: (()+0x7e9a) [0x7f72d03c7e9a] 10: (clone()+0x6d) [0x7f72d00f4cbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph] Fix more performance issues found by cppcheck (#51)
Hey Danny, I've merged in most of these (commit ffda2eab4695af79abdc9ed9bf001c3cd662a1f2) but had comments on a couple: d99764e8c72a24eaba0542944f497cc2d9e154b4 is a patch on gtest. We did import that wholesale into our repository as that's what they recommend,b but I'd prefer to get patches by re-importing rather than by applying them to our tree. That patch should go upstream. :) 3fc14b3470748578840ed9374db53e9ef9926382 and 7ca6e5d8875d06aa61ce35b727ce7ee219838c69 are patches to remove the useless definition of a declared variable in cases like: bool success = false; ... // nothing that reads success ... success = function(); where the proposed fix is just doing: bool success; ... // nothing that reads success ... success = function(); However, we'd prefer for defensive programming reasons that variables be defined on declaration whenever possible. For these patches it's appropriate enough to just move the declaration to the first real definition (and I've done so), but for cases where that's not acceptable we'd prefer to take whatever performance hit there is in order to not have random garbage in the stack. ;) I did opt to leave the wireshark patch alone, because...C. But if you or Yehuda want to take another pass through that, I notice that it's still somewhat inconsistent about variable assignments on initialization. -Greg On Wed, Feb 13, 2013 at 12:47 PM, Danny Al-Gaaf notificati...@github.com wrote: Here some more patches to fix performance issues found by cppcheck. This should now cover - together with wip-da-sca-cppcheck-performance - the following issues: use empty() instead of size() to check for emptiness dont't pass string::c_str() to string arguments prevent useless value assignment pass some objects by reference instead of by-value You can merge this Pull Request by running git pull https://github.com/dalgaaf/ceph wip-da-sca-cppcheck-performance-2 Or view, comment on, or merge it at: https://github.com/ceph/ceph/pull/51 Commit Summary CephxProtocol.h: pass CryptoKey by reference to decode_decrypt() CInode.h: use !old_inodes.empty() instead of size() AuthMonitor.cc: use !pending_auth.empty() instead of 'size() 0' OSDMonitor.h: use !reporters.empty() instead of size() MonCaps.cc: use !empty() instead of size() Monitor.cc: use empty() instead of size() OSDMonitor.cc: use !empty() instead of size() PGMonitor.cc: use !empty() instead of size() to check for emptiness monmaptool.cc: use empty() instead of size() to check for emptiness DBObjectMap.cc: use empty() instead of size() to check for emptiness FileStore.cc: use empty() instead of size() to check for emptiness HashIndex.cc: use empty() instead of size() to check for emptiness LFNIndex.cc: use !holes.empty() instead of 'size() 0' OSD.cc: use empty() instead of size() to check for emptiness PG.cc: use empty() instead of size() to check for emptiness ReplicatedPG.cc: use empty() instead of size() to check for emptiness ObjectCacher.cc: use empty() instead of !size() to check for emptiness Objecter.cc: use !empty() instead of size() to check for emptiness Objecter.cc: prevent useless value assignment osdmaptool.cc: : use empty() instead of 'size() 1' rados.cc: use omap.empty() instead of size() to check for emptiness rbd.cc: use empty() instead of size() to check for emptiness rgw/rgw_admin.cc: prevent useless value assignment rgw/rgw_admin.cc: use empty() instead of size() to check for emptiness rgw/rgw_gc.cc: use !empty() instead of size() to check for emptiness rgw/rgw_log.cc: don't pass c_str() result to std::string argument cls/rbd/cls_rbd.cc: use !empty() instead of 'size() 0' cls_refcount.cc: use empty() instead of !size() to check for emptiness common/WorkQueue.cc: use !empty() instead of size() to check for emptiness obj_bencher.cc: use empty() instead of 'size() == 0' to check for emptiness crush/CrushWrapper.cc: don't pass c_str() result to std::string argument crushtool.cc: use !empty() instead of 'size() 0' to check for emptiness use empty() instead of 'size() == 0' to check for emptiness cls_kvs.cc: use !empty() instead of 'size() 0' to check for emptiness kv_flat_btree_async.cc: use empty() instead of size() to check for emptiness librbd/internal.cc: use !empty() instead of size() mds/CDir.cc: use !empty() instead of size() mds/CInode.cc: use !empty() instead of size() mds/Locker.cc: use !empty() instead of size() mds/MDCache.cc: use empty() instead of size() to check for emptiness mds/MDS.cc: use !empty() instead of size() to check for emptiness mds/MDSMap.cc: use !empty() instead of size() to check for emptiness mds/SnapServer.cc: use !empty() instead of size() to check for emptiness mds/journal.cc: use !empty() instead of size() to check for emptiness rgw/rgw_main.cc: use empty() instead of size() to check for emptiness rgw/rgw_op.cc: use empty() instead of size() to check for emptiness rgw/rgw_rados.cc: :
Re: [ceph-commit] [ceph/ceph] e330b7: mon: create fail_mds_gid() helper; make 'ceph mds ...
On Thu, Feb 14, 2013 at 11:39 AM, GitHub nore...@github.com wrote: Branch: refs/heads/master Home: https://github.com/ceph/ceph Commit: e330b7ec54f89ca799ada376d5615e3c1dfc54f0 https://github.com/ceph/ceph/commit/e330b7ec54f89ca799ada376d5615e3c1dfc54f0 Author: Sage Weil s...@inktank.com Date: 2013-01-17 (Thu, 17 Jan 2013) Changed paths: M src/mon/MDSMonitor.cc M src/mon/MDSMonitor.h Log Message: --- mon: create fail_mds_gid() helper; make 'ceph mds rm ...' more generic Take a gid or a rank or a name. Use a nicer helper. Signed-off-by: Sage Weil s...@inktank.com Commit: 2e11297750a1b683c41f58c3fae05321fc49 https://github.com/ceph/ceph/commit/2e11297750a1b683c41f58c3fae05321fc49 Author: Sage Weil s...@inktank.com Date: 2013-01-17 (Thu, 17 Jan 2013) Changed paths: M src/common/config_opts.h M src/mds/MDSMap.h M src/mon/MDSMonitor.cc Log Message: --- mon: enforce unique name in mdsmap Add 'mds enforce unique name' option, defaulting to true. If set, when an MDS boots, it will kick any previous mds with the same name from the mdsmap. This is possibly less confusing for users. If an mds daemon restarts, it will immediately replace its previous instantiation. Two misconfigured daemons running with the same name will fight over the same role. Fixes: #3857 Signed-off-by: Sage Weil s...@inktank.com Commit: dd7caf5f411696f8e7dc108270a8e85a34f3e80c https://github.com/ceph/ceph/commit/dd7caf5f411696f8e7dc108270a8e85a34f3e80c Author: Sage Weil s...@inktank.com Date: 2013-01-17 (Thu, 17 Jan 2013) Changed paths: M src/mds/MDS.cc Log Message: --- mds: gracefully exit if newer gid replaces us by name If 'mds enforce unique name' is set, and another MDS with the same name kicks us out of the MDSMap, gracefully exit instead of respawning and fighting over our position. Signed-off-by: Sage Weil s...@inktank.com Commit: 6f28faf9e6613bff403bcd958818d8dccd004f9d https://github.com/ceph/ceph/commit/6f28faf9e6613bff403bcd958818d8dccd004f9d Author: Sage Weil s...@inktank.com Date: 2013-01-17 (Thu, 17 Jan 2013) Changed paths: M src/mds/MDCache.cc Log Message: --- mds: open mydir after replay In certain cases, we may replay the journal and not end up with the dirfrag for mydir open. This is fine--we just need to open it up and fetch it below. Signed-off-by: Sage Weil s...@inktank.com In the tests I ran last night on this branch I saw some Valgrind warnings in the OSDs and Monitors, but I couldn't figure out any way for this series to have caused them so I assume they're latent and pop up occasionally in master? In any case, please keep an eye out in case I was wrong. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-commit] [ceph/ceph] e330b7: mon: create fail_mds_gid() helper; make 'ceph mds ...
On Thu, 14 Feb 2013, Gregory Farnum wrote: In the tests I ran last night on this branch I saw some Valgrind warnings in the OSDs and Monitors, but I couldn't figure out any way for this series to have caused them so I assume they're latent and pop up occasionally in master? In any case, please keep an eye out in case I was wrong. :) That is probably a sneaky MMonPaxos leak in the mon. And the OSD isn't yet valgrind leak check clean. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Further thoughts on fsck for CephFS
Sage sent out an early draft of what we were thinking about doing for fsck on CephFS at the beginning of the week, but it was a bit incomplete and still very much a work in progress. I spent a good chunk of today thinking about it more so that we can start planning ticket-level chunks of work. The following is similar to where Sage's email ended up, but incorporates a bit more thought about memory scaling and is hopefully a bit more organized. :) First, we are breaking up development and running of fsck into two distinct phases. The first phase will consist of a forward scrub, which simply starts with the root directory inode and follows links forward to check that it can find everything that's linked, and that the forward- and backward-links are consistent. (Backward links are under development right now; see http://tracker.ceph.com/issues/3540, or the CephFS backlog at http://tracker.ceph.com/rb/master_backlogs/cephfs, which is only groomed for the first several items on the list but might be of interest.) The intention for this phase is that it can be used both as part of a requested full-system fsck, and separately can be used to do background scrubbing during normal operation. I've tried to think through this forward scrub phase enough to do real development planning over the next couple of days, and have included my description below. Please comment if you see issues or have questions. The second phase we're referring to as the backward scan. This mode is currently intended to be used as part of the fsck you would run after somehow losing data in RADOS, and is exclusively an offline operation — no client access to the data is permitted, etc and it involves scanning through every object in the CephFS metadata and data storage pools. We haven't thought this one through in quite as much detail, but I wanted to figure out a mechanism (that scales to large directories and hierarchies) enough to see how it might impact the design of our forward scrub. I've got the details I came up with below, but this is a much more complicated problem and not one we need to start work on right way so it doesn't go into nearly as much depth. Again though, please comment if you see any issues, have questions, or think there's something in the backward scan that impacts the forward scrub in a way I haven't accounted for! Thanks, Greg MDS Forward Scrub We maintain a stack of inodes to scrub. When a new scrub is requested, the inode in question goes into this stack at a position depending on how it's inserted. We have a separate scrubbing thread in every MDS. This thread begins in the scrub_node(inode) function, passing in the inode on the top of the scrub stack. scrub_node() starts by setting a new scrub_start_stamp and scrub_start_version on the inode (where the scrub_start_version is the version of the *parent* of the inode). If the node is a file: the thread optionally spins off an async check of the backtrace (and in the future, optionally checks other metadata we might be able to add or pick up), then sleeps until finish_scrub(inode) is called. (If it doesn't do the backtrace check, it calls finish_scrub() directly). If the node is a dirfrag: put the dirfrag's first child on the top of the stack, and call scrub_node(child). Note that this might involve reading the dirfrag off disk, etc. finish_scrub(inode) is pretty simple. If the inode is a dirfrag: It verifies that the parent's data matches the aggregate data of the children, then does the same stuff as to a file: 1) sets last_scrubbed_stamp to scrub_start_stamp, and last_scrubbed_version to scrub_start_version. 2) Pops the inode off of the scrub queue, and checks if the next thing up is the inode's parent. 3) If so, calls scrub_node() on the dentry following this one in the parent dirfrag. 3b) if there are no remaining nodes in the parent dirfrag, it checks that all the children were scrubbed following the parent's scrub_start_version (or modified — we don't want to scrub hierarchies that were renamed into the tree following a scrub start), then calls finish_scrub() on the dirfrag. If at any point the scrub thread finishes scrubbing a node which does not start up another one immediately (implying that another scrub got injected into the middle of one that was already running), it looks at the node in question. If it's a file, it calls scrub_node() on it. If it's a dirfrag, it finds the first dentry in the dirfrag with a last_scrubbed_version less than the dirfrag's last_scrubbed_version, puts that dentry on the scrub_stack, and calls scrub_node() on that dentry. This is simple enough in concept (although functionally it will need to be broken up quite a bit more in order to do all the locking in a reasonably efficient fashion). To expand this to a multi-MDS system, modify it slightly according to the following rules: 1) Only the authoritative
Re: slow requests, hunting for new mon
On 2013-02-12, Chris Dunlop ch...@onthe.net.au wrote: Hi, What are likely causes for slow requests and monclient: hunting for new mon messages? E.g.: 2013-02-12 16:27:07.318943 7f9c0bc16700 0 monclient: hunting for new mon ... 2013-02-12 16:27:45.892314 7f9c13c26700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for 30.383883 secs 2013-02-12 16:27:45.892323 7f9c13c26700 0 log [WRN] : slow request 30.383883 seconds old, received at 2013-02-12 16:27:15.508374: osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 921600~4096] 2.981cf6bc) v4 currently no flag points reached 2013-02-12 16:27:45.892328 7f9c13c26700 0 log [WRN] : slow request 30.383782 seconds old, received at 2013-02-12 16:27:15.508475: osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 987136~4096] 2.981cf6bc) v4 currently no flag points reached 2013-02-12 16:27:45.892334 7f9c13c26700 0 log [WRN] : slow request 30.383720 seconds old, received at 2013-02-12 16:27:15.508537: osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 1036288~8192] 2.981cf6bc) v4 currently no flag points reached 2013-02-12 16:27:45.892338 7f9c13c26700 0 log [WRN] : slow request 30.383684 seconds old, received at 2013-02-12 16:27:15.508573: osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 1454080~4096] 2.fff29a9a) v4 currently no flag points reached 2013-02-12 16:27:45.892341 7f9c13c26700 0 log [WRN] : slow request 30.328986 seconds old, received at 2013-02-12 16:27:15.563271: osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 1482752~4096] 2.fff29a9a) v4 currently no flag points reached OK, for the sake of anyone who might come across this thread when searching for similar issues... http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#slow-or-unresponsive-osd ...unfortunately the error message in the link above says old request rather than slow request (old code?), so that page doesn't come up when googling for the slow request message. The page needs updating. The underlying problem in our case seems to have been spikes in the number of IOPS going to the disks (e.g. watch 'iostat -x' output). Whilst the disks were coping with steady state load, occasionally something (in this case, activity in a vm running on rbd) would cause a spike in activity and the disks couldn't cope. I'd initially looked at the amount of data going to the disks and thought it was well with the disks' capabilities, however I hadn't considered the IOPS. The (partial?) solution was to move the journals onto a separate device, halving the IOPS going to the data disk (write journal, write data) as well as avoiding having the heads slamming back and forth between the data and journal. We're continuing to watch the IOPS and will add more OSDs to spread the load further if necessary. I still don't know what the hunting messages actually indicate, but they've also disappeared since fixing the slow request messages. Incidentally, it strikes me that there is a significant amount of write amplification going on when running vms with a file system such as xfs or ext4 (with journal) on top of rbd/rados (with journal) on top of xfs (with journal). I.e. a single write from a vm can turn into up to 8 separate writes by the time it hits the underlying xfs filesystem. I think this is why our ceph setup is struggling at far less load on the same hardware compared to the drbd setup we're wanting to replace. Cheers! Chris. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow requests, hunting for new mon
On Fri, 15 Feb 2013, Chris Dunlop wrote: On 2013-02-12, Chris Dunlop ch...@onthe.net.au wrote: Hi, What are likely causes for slow requests and monclient: hunting for new mon messages? E.g.: 2013-02-12 16:27:07.318943 7f9c0bc16700 0 monclient: hunting for new mon ... 2013-02-12 16:27:45.892314 7f9c13c26700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for 30.383883 secs 2013-02-12 16:27:45.892323 7f9c13c26700 0 log [WRN] : slow request 30.383883 seconds old, received at 2013-02-12 16:27:15.508374: osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 921600~4096] 2.981cf6bc) v4 currently no flag points reached 2013-02-12 16:27:45.892328 7f9c13c26700 0 log [WRN] : slow request 30.383782 seconds old, received at 2013-02-12 16:27:15.508475: osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 987136~4096] 2.981cf6bc) v4 currently no flag points reached 2013-02-12 16:27:45.892334 7f9c13c26700 0 log [WRN] : slow request 30.383720 seconds old, received at 2013-02-12 16:27:15.508537: osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 1036288~8192] 2.981cf6bc) v4 currently no flag points reached 2013-02-12 16:27:45.892338 7f9c13c26700 0 log [WRN] : slow request 30.383684 seconds old, received at 2013-02-12 16:27:15.508573: osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 1454080~4096] 2.fff29a9a) v4 currently no flag points reached 2013-02-12 16:27:45.892341 7f9c13c26700 0 log [WRN] : slow request 30.328986 seconds old, received at 2013-02-12 16:27:15.563271: osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 1482752~4096] 2.fff29a9a) v4 currently no flag points reached OK, for the sake of anyone who might come across this thread when searching for similar issues... http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#slow-or-unresponsive-osd ...unfortunately the error message in the link above says old request rather than slow request (old code?), so that page doesn't come up when googling for the slow request message. The page needs updating. Updated, thanks! The underlying problem in our case seems to have been spikes in the number of IOPS going to the disks (e.g. watch 'iostat -x' output). Whilst the disks were coping with steady state load, occasionally something (in this case, activity in a vm running on rbd) would cause a spike in activity and the disks couldn't cope. I'd initially looked at the amount of data going to the disks and thought it was well with the disks' capabilities, however I hadn't considered the IOPS. The (partial?) solution was to move the journals onto a separate device, halving the IOPS going to the data disk (write journal, write data) as well as avoiding having the heads slamming back and forth between the data and journal. We're continuing to watch the IOPS and will add more OSDs to spread the load further if necessary. I still don't know what the hunting messages actually indicate, but they've also disappeared since fixing the slow request messages. This usually means the monitor was responding and we (the OSD or client) is trying to reconnect (to a random monitor). Incidentally, it strikes me that there is a significant amount of write amplification going on when running vms with a file system such as xfs or ext4 (with journal) on top of rbd/rados (with journal) on top of xfs (with journal). I.e. a single write from a vm can turn into up to 8 separate writes by the time it hits the underlying xfs filesystem. I think this is why our ceph setup is struggling at far less load on the same hardware compared to the drbd setup we're wanting to replace. Currently, yes. There is always going to be some additional overhead because the object data is stored in a file system. We were/are doing several other non-optimal things too, however, that is being improved in the current master branch (moving some metadata into leveldb which does a better job of managing the IO pattern). Stay tuned! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] snapshot, clone and mount a VM-Image
On 02/14/2013 12:53 PM, Sage Weil wrote: Hi Jens- On Thu, 14 Feb 2013, Jens Kristian S?gaard wrote: Hi Sage, block device level. We plan to implement an incremental backup function for the relative change between two snapshots (or a snapshot and the head). It's O(n) the size of the device vs the number of files, but should be more efficient for all but the most sparse of images. The implementation should be simple; the challenge is mostly around the incremental file format, probably. That doesn't help you now, but would be a relatively self-contained piece of functionality for someone to contribute to RBD. This isn't a top I'm very interesting in having an incremental backup tool for Ceph, so if it is possible for me to do, I would like to take a shot at implementing it. It will be a spare time project, so I cannot say how fast it will progress though. If you have any details on how you would like to see the implementation work, please let me know! Great to hear you're interested in this! There is a feature in the tracker open: http://tracker.ceph.com/issues/4084 (Not that there is much information there yet!) I think this breaks down into a few different pieces: 1) Decide what output format to use. We want to use something that is resembles a portable, standard way of representing an incremental set of changes to a block device (or large file). I'm not sure what is out there, but we should look carefully before making up our own format. 2) Expose changes objects between rados snapshots. This is some generic functionality we would bake into librbd that would probably work similarly to how read_iterate() currently does (you specify a callback). We probably also want to provide this information directly to a user, so that they can get a dump of (offsets, length) pairs for integration with their own tool. I expect this is just a core librbd method. It'd be nice to implement it as more than one request at once (unlike read_iterate()'s current implementation). The interface could still be the same though. 3) Write a dumper based on #2 that outputs in format from #1. The callback would (instead of printing file offsets) write the data to the output stream with appropriate metadata indicating which part of the image it is. Ideally the output part would be modular, too, so that we can come back later and implement support for new formats easily. The output data stream should be able to be directed at stdout or a file. 4) Write an importer for #1. It would take as input an existing image, assumed to be in the state of the reference snapshot, and write all the changed bits. Take input from stdin or a file. I think it'd be good to have some kind of safety check here by default. Storing a checksum of the original snapshot with the backup and comparing to the image being restored onto would work, but would be pretty slow. Any ideas for better ways to do this? 5) If necessary, extend the above so that image resize events are properly handled. Couldn't this be handled by storing the size of the original snapshot in the diff, and resizing to the size of the diff when restoring? Is there another issue you're thinking of? Probably the trickiest bit here is #2, as it will probably involve adding some low-level rados operations to efficiently query the snapshot state from the client. With this (and any of the rest), we can help figure out how to integrate it cleanly. My suggestion is to start with #1, though (and make sure the rest of this all makes sense to everyone). THanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mon losing touch with OSDs
G'day, In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the mons to lose touch with the osds? I imagine a network glitch could cause it, but I can't see any issues in any other system logs on any of the machines on the network. Having (mostly?) resolved my previous slow requests issue (http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around 13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1 5 seconds later: ceph-mon.b2.log: 2013-02-14 20:11:19.892060 7fa48d4f8700 0 log [INF] : pgmap v2822096: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:11:21.719513 7fa48d4f8700 0 log [INF] : pgmap v2822097: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd or pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds ago. marking down 2013-02-14 20:26:20.780244 7fa48d4f8700 1 mon.b2@0(leader).osd e769 e769: 2 osds: 1 up, 2 in 2013-02-14 20:26:20.837123 7fa48d4f8700 0 log [INF] : osdmap e769: 2 osds: 1 up, 2 in 2013-02-14 20:26:20.947523 7fa48d4f8700 0 log [INF] : pgmap v2822098: 576 pgs: 304 active+clean, 272 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:26:25.709341 7fa48dcf9700 -1 mon.b2@0(leader).osd e769 no osd or pg stats from osd.1 since 2013-02-14 20:11:21.523741, 904.185596 seconds ago. marking down 2013-02-14 20:26:25.822773 7fa48d4f8700 1 mon.b2@0(leader).osd e770 e770: 2 osds: 0 up, 2 in 2013-02-14 20:26:25.863493 7fa48d4f8700 0 log [INF] : osdmap e770: 2 osds: 0 up, 2 in 2013-02-14 20:26:25.954799 7fa48d4f8700 0 log [INF] : pgmap v2822099: 576 pgs: 576 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:31:30.772360 7fa48dcf9700 0 log [INF] : osd.1 out (down for 304.933403) 2013-02-14 20:31:30.893521 7fa48d4f8700 1 mon.b2@0(leader).osd e771 e771: 2 osds: 0 up, 1 in 2013-02-14 20:31:30.933439 7fa48d4f8700 0 log [INF] : osdmap e771: 2 osds: 0 up, 1 in 2013-02-14 20:31:31.055408 7fa48d4f8700 0 log [INF] : pgmap v2822100: 576 pgs: 576 stale+active+clean; 407 GB data, 417 GB used, 1444 GB / 1862 GB avail 2013-02-14 20:35:05.831221 7fa48dcf9700 0 log [INF] : osd.0 out (down for 525.033581) 2013-02-14 20:35:05.989724 7fa48d4f8700 1 mon.b2@0(leader).osd e772 e772: 2 osds: 0 up, 0 in 2013-02-14 20:35:06.031409 7fa48d4f8700 0 log [INF] : osdmap e772: 2 osds: 0 up, 0 in 2013-02-14 20:35:06.129046 7fa48d4f8700 0 log [INF] : pgmap v2822101: 576 pgs: 576 stale+active+clean; 407 GB data, 0 KB used, 0 KB / 0 KB avail The other 2 mons both have messages like this in their logs, starting at around 20:12: 2013-02-14 20:12:26.534977 7f2092b86700 0 -- 10.200.63.133:6789/0 10.200.63.133:6800/6466 pipe(0xade76500 sd=22 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1) 2013-02-14 20:13:24.741092 7f2092d88700 0 -- 10.200.63.133:6789/0 10.200.63.132:6800/2456 pipe(0x9f8b7180 sd=28 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1) 2013-02-14 20:13:56.551908 7f2090560700 0 -- 10.200.63.133:6789/0 10.200.63.133:6800/6466 pipe(0x9f8b6000 sd=41 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1) 2013-02-14 20:14:24.752356 7f209035e700 0 -- 10.200.63.133:6789/0 10.200.63.132:6800/2456 pipe(0x9f8b6500 sd=42 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1) (10.200.63.132 is mon.b4/osd.0, 10.200.63.133 is mon.b5/osd.1) ...although Greg Farnum indicates these messages are normal: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/5989/focus=5993 Osd.0 doesn't show any signs of distress at all: ceph-osd.0.log: 2013-02-14 20:00:10.280601 7ffceb012700 0 log [INF] : 2.7e scrub ok 2013-02-14 20:14:19.923490 7ffceb012700 0 log [INF] : 2.5b scrub ok 2013-02-14 20:14:50.571980 7ffceb012700 0 log [INF] : 2.7b scrub ok 2013-02-14 20:17:48.475129 7ffceb012700 0 log [INF] : 2.7d scrub ok 2013-02-14 20:28:22.601594 7ffceb012700 0 log [INF] : 2.91 scrub ok 2013-02-14 20:28:32.839278 7ffceb012700 0 log [INF] : 2.92 scrub ok 2013-02-14 20:28:46.992226 7ffceb012700 0 log [INF] : 2.93 scrub ok 2013-02-14 20:29:12.330668 7ffceb012700 0 log [INF] : 2.95 scrub ok ...although osd.1 started seeing problems around this time: ceph-osd.1.log: 2013-02-14 20:03:11.413352 7fd1d8f0a700 0 log [INF] : 2.23 scrub ok 2013-02-14 20:26:51.601425 7fd1e6f26700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for 30.750063 secs 2013-02-14 20:26:51.601432 7fd1e6f26700 0 log [WRN] : slow request 30.750063 seconds old, received at 2013-02-14 20:26:20.851304: osd_op(client.9983.0:28173 xxx.rbd [watch 1~0] 2.10089424) v4 currently wait for new map 2013-02-14 20:26:51.601437 7fd1e6f26700 0 log [WRN] : slow request 30.749947 seconds old,
Re: Mon losing touch with OSDs
Hi Chris, On Fri, 15 Feb 2013, Chris Dunlop wrote: G'day, In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the mons to lose touch with the osds? I imagine a network glitch could cause it, but I can't see any issues in any other system logs on any of the machines on the network. Having (mostly?) resolved my previous slow requests issue (http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around 13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1 5 seconds later: ceph-mon.b2.log: 2013-02-14 20:11:19.892060 7fa48d4f8700 0 log [INF] : pgmap v2822096: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:11:21.719513 7fa48d4f8700 0 log [INF] : pgmap v2822097: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd or pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds ago. marking down There is a safety check that if the osd doesn't check in for a long period of time we assume it is dead. But it seems as though that shouldn't happen, since osd.0 has some PGs assigned and is scrubbing away. Can you enable 'debug ms = 1' on the mons and leave them that way, in the hopes that this happens again? It will give us more information to go on. ...although osd.1 started seeing problems around this time: ceph-osd.1.log: 2013-02-14 20:03:11.413352 7fd1d8f0a700 0 log [INF] : 2.23 scrub ok 2013-02-14 20:26:51.601425 7fd1e6f26700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for 30.750063 secs 2013-02-14 20:26:51.601432 7fd1e6f26700 0 log [WRN] : slow request 30.750063 seconds old, received at 2013-02-14 20:26:20.851304: osd_op(client.9983.0:28173 xxx.rbd [watch 1~0] 2.10089424) v4 currently wait for new map 2013-02-14 20:26:51.601437 7fd1e6f26700 0 log [WRN] : slow request 30.749947 seconds old, received at 2013-02-14 20:26:20.851420: osd_op(client.10001.0:618473 yy.rbd [watch 1~0] 2.3854277a) v4 currently wait for new map 2013-02-14 20:26:51.601440 7fd1e6f26700 0 log [WRN] : slow request 30.749938 seconds old, received at 2013-02-14 20:26:20.851429: osd_op(client.9998.0:39716 zz.rbd [watch 1~0] 2.71731007) v4 currently wait for new map 2013-02-14 20:26:51.601442 7fd1e6f26700 0 log [WRN] : slow request 30.749907 seconds old, received at 2013-02-14 20:26:20.851460: osd_op(client.10007.0:59572 aa.rbd [watch 1~0] 2.320eebb8) v4 currently wait for new map 2013-02-14 20:26:51.601445 7fd1e6f26700 0 log [WRN] : slow request 30.749630 seconds old, received at 2013-02-14 20:26:20.851737: osd_op(client.9980.0:86883 bb.rbd [watch 1~0] 2.ab9b579f) v4 currently wait for new map Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened in any of the many previous slow requests intances, and the timing doesn't look quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but the osd.0 log shows nothing problems at all, then the mon complains about not having heard from osd.1 since 20:11:21, whereas the first indication of trouble on osd.1 was the request from 20:26:20 not being processed in a timely fashion. My guess is the above was a side-effect of osd.0 being marked out. On 0.56.2 there is some strange peering workqueue laggyness that could potentially contribute as well. I recommend moving to 0.56.3. No knowing enough about how the various pieces of ceph talk to each other makes it difficult to distinguish cause and effect! Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did restarting the osds ('service ceph restart osd' on each osd host). The immediate issue was resolved by restarting ceph completely on one of the mon/osd hosts (service ceph restart). Possibly a restart of just the mon would have been sufficient. Did you notice that the osds you restarted didn't immediately mark themselves in? Again, it could be explained by the peering wq issue, especially if there are pools in your cluster that are not getting any IO. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html