Re: [ceph-users] Help needed porting Ceph to RSockets
Hi Andreas, I think we're both working on the same thing, I've just changed the function calls over to rsockets in the source instead of using the pre-load library. It explains why we're having the exact same problem! From what I've been able to tell the entire problem revolves around rsockets not supporting POLLRDHUP. As far as I can tell the pipe will only be removed when tcp_read_wait returns -1. With rsockets it never receives the POLLRDHUP event after shutdown_socket() is called so the rpoll call blocks until timeout (900 seconds) and the pipe stays active. The question then would be how can we destroy a pipe without relying on POLLRDHUP? shutdown_socket() always gets called when the socket should be closed so could there might be a way to trick tcp_read_wait() into returning -1 by doing somethere in shutdown_socket() but I'm not sure how to go about it. Any ideas? On Mon, Aug 12, 2013 at 1:55 PM, Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Matthew, On Fri, 9 Aug 2013 09:11:07 +0200 Matthew Anderson manderson8...@gmail.com wrote: So I've had a chance to re-visit this since Bécholey Alexandre was kind enough to let me know how to compile Ceph with the RDMACM library (thankyou again!). At this stage it compiles and runs but there appears to be a problem with calling rshutdown in Pipe as it seems to just wait forever for the pipe to close which causes commands like 'ceph osd tree' to hang indefinitely after they work successfully. Debug MS is here - http://pastebin.com/WzMJNKZY I am currently looking at a very similar problem. My test setup is to start ceph-mon monitors and check their state using ceph mon stat. The monitors (3 instances) and the ceph mon stat command are started with LD_PRELOAD=path to librspreload.so. The behaviour is that the ceph mon stat command connects, sends the request and receives the answer, which shows a healthy state for the monitors. But the ceph mon stat does not terminate. On the monitor end I encounter an EOPNOTSUPP being set at the time the connection shall terminate. This is detected in the Pipe::tcp_read_wait() where the socket is poll'ed for IN and HUP events. What I have found out already is that it is not the poll() / rpoll() which set the error: they do return a HUP event and are happy. As far as I can tell, the fact of the EOPNOTSUPP being set is historical at that point, i.e. it must have been set at some earlier stage. I am using ceph v0.61.7. Best Regards Andreas I also tried RADOS bench but it appears to be doing something similar. Debug MS is here - http://pastebin.com/3aXbjzqS It seems like it's very close to working... I must be missing something small that's causing some grief. You can see the OSD coming up in the ceph monitor and the PG's all become active+clean. When shutting down the monitor I get the below which show's it waiting for the pipes to close - 2013-08-09 15:08:31.339394 7f4643cfd700 20 accepter.accepter closing 2013-08-09 15:08:31.382075 7f4643cfd700 10 accepter.accepter stopping 2013-08-09 15:08:31.382115 7f464bd397c0 20 -- 172.16.0.1:6789/0 wait: stopped accepter thread 2013-08-09 15:08:31.382127 7f464bd397c0 20 -- 172.16.0.1:6789/0 wait: stopping reaper thread 2013-08-09 15:08:31.382146 7f4645500700 10 -- 172.16.0.1:6789/0 reaper_entry done 2013-08-09 15:08:31.382182 7f464bd397c0 20 -- 172.16.0.1:6789/0 wait: stopped reaper thread 2013-08-09 15:08:31.382194 7f464bd397c0 10 -- 172.16.0.1:6789/0 wait: closing pipes 2013-08-09 15:08:31.382200 7f464bd397c0 10 -- 172.16.0.1:6789/0 reaper 2013-08-09 15:08:31.382205 7f464bd397c0 10 -- 172.16.0.1:6789/0 reaper done 2013-08-09 15:08:31.382210 7f464bd397c0 10 -- 172.16.0.1:6789/0 wait: waiting for pipes 0x3014c80,0x3015180,0x3015400 to close The git repo has been updated if anyone has a few spare minutes to take a look - https://github.com/funkBuild/ceph-rsockets Thanks again -Matt On Thu, Jun 20, 2013 at 5:09 PM, Matthew Anderson manderson8...@gmail.com wrote: Hi All, I've had a few conversations on IRC about getting RDMA support into Ceph and thought I would give it a quick attempt to hopefully spur some interest. What I would like to accomplish is an RSockets only implementation so I'm able to use Ceph, RBD and QEMU at full speed over an Infiniband fabric. What I've tried to do is port Pipe.cc and Acceptor.cc to rsockets by replacing the regular socket calls with the rsocket equivalent. Unfortunately it doesn't compile and I get an error of - CXXLD ceph-osd ./.libs/libglobal.a(libcommon_la-Accepter.o): In function `Accepter::stop()': /home/matt/Desktop/ceph-0.61.3-rsockets/src/msg/Accepter.cc:243: undefined reference to `rshutdown' /home/matt/Desktop/ceph-0.61.3-rsockets/src/msg/Accepter.cc:251: undefined reference to `rclose' ./.libs/libglobal.a(libcommon_la-Accepter.o): In function `Accepter::entry()':
[ceph-users] Start Stop OSD
I have 2 issues that I can not find a solution to. First: I am unable to stop / start any osd by command. I have deployed with ceph-deploy on Ubuntu 13.04 and everything seems to be working find. I have 5 hosts 5 mons and 20 osds. Using initctl list | grep ceph gives me ceph-mds-all-starter stop/waiting ceph-mds-all start/running ceph-osd-all start/running ceph-osd-all-starter stop/waiting ceph-all start/running ceph-mon-all start/running ceph-mon-all-starter stop/waiting ceph-mon (ceph/cloud4) start/running, process 1841 ceph-create-keys stop/waiting ceph-osd (ceph/15) start/running, process 2122 ceph-mds stop/waiting However OSD 12 13 14 15 are all on this server. sudo stop ceph-osd id=12 gives me stop: Unknown instance: ceph/12 Does anyone know what is wrong? Nothing in logs. Also, when trying to put the journal on an SSD everything works fine. I can add all 4 disks per host to the same SSD. The issue is when I restart the server, only 1 out of the 3 OSDs will come back up. Has anyone else had this issue? Thanks! Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph rbd io tracking (rbdtop?)
Hi, The activity on our ceph cluster has gone up a lot. We are using exclusively RBD storage right now. Is there a tool/technique that could be used to find out which rbd images are receiving the most activity (something like rbdtop)? Thanks, Jeff -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-deploy and journal on separate disk
Hi. I have some problems with create journal on separate disk, using ceph-deploy osd prepare command. When I try execute next command: ceph-deploy osd prepare ceph001:sdaa:sda1 where: sdaa - disk for ceph data sda1 - partition on ssd drive for journal I get next errors: ceph@ceph-admin:~$ ceph-deploy osd prepare ceph001:sdaa:sda1 ceph-disk-prepare -- /dev/sdaa /dev/sda1 returned 1 Information: Moved requested sector from 34 to 2048 in order to align on 2048-sector boundaries. The operation has completed successfully. meta-data=/dev/sdaa1 isize=2048 agcount=32, agsize=22892700 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=732566385, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=357698, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data mount: /dev/sdaa1: more filesystems detected. This should not happen, use -t type to explicitly specify the filesystem type or use wipefs(8) to clean up the device. mount: you must specify the filesystem type ceph-disk: Mounting filesystem failed: Command '['mount', '-o', 'noatime', '--', '/dev/sdaa1', '/var/lib/ceph/tmp/mnt.ek6mog']' returned non-zero exit status 32 Someone had a similar problem? Thanks for the help ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] suscribe
___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph rbd io tracking (rbdtop?)
On Mon, Aug 12, 2013 at 03:19:04PM +0200, Jeff Moskow wrote: Hi, The activity on our ceph cluster has gone up a lot. We are using exclusively RBD storage right now. Is there a tool/technique that could be used to find out which rbd images are receiving the most activity (something like rbdtop)? I'm using (on each OSD node in a small xterm): nmon# http://nmon.sourceforge.net/pmwiki.php d . -Dieter Thanks, Jeff -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph rbd io tracking (rbdtop?)
On 08/12/2013 03:19 PM, Jeff Moskow wrote: Hi, The activity on our ceph cluster has gone up a lot. We are using exclusively RBD storage right now. Is there a tool/technique that could be used to find out which rbd images are receiving the most activity (something like rbdtop)? Are you using libvirt with KVM? If so, you can always poll all the disk I/O for all VMs using libvirt. Thanks, Jeff -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD Keep Crashing
Hi, It seems my OSD processes keep crashing randomly and I don't know why. It seems to happens when the cluster is trying to re-balance... In normal usange I didn't notice any crash like that. We running ceph 0.61.7 on an up to date ubuntu 12.04 (all packages including kernel are current). Anyone have an idea ? TRACE: ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) 1: /usr/bin/ceph-osd() [0x79219a] 2: (()+0xfcb0) [0x7fd692da1cb0] 3: (gsignal()+0x35) [0x7fd69155a425] 4: (abort()+0x17b) [0x7fd69155db8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd691eac69d] 6: (()+0xb5846) [0x7fd691eaa846] 7: (()+0xb5873) [0x7fd691eaa873] 8: (()+0xb596e) [0x7fd691eaa96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x84303f] 10: (PG::RecoveryState::Recovered::Recovered(boost::statechart::statePG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::my_context)+0x38f) [0x6d932f] 11: (boost::statechart::statePG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::shallow_construct(boost::intrusive_ptrPG::RecoveryState::Active const, boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator)+0x5c) [0x6f270c] 12: (PG::RecoveryState::Recovering::react(PG::AllReplicasRecovered const)+0xb4) [0x6d9454] 13: (boost::statechart::simple_statePG::RecoveryState::Recovering, PG::RecoveryState::Active, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base const, void const*)+0xda) [0x6f296a] 14: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator::send_event(boost::statechart::event_base const)+0x5b) [0x6e320b] 15: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator::process_event(boost::statechart::event_base const)+0x11) [0x6e34e1] 16: (PG::handle_peering_event(std::tr1::shared_ptrPG::CephPeeringEvt, PG::RecoveryCtx*)+0x347) [0x69aaf7] 17: (OSD::process_peering_events(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x2f5) [0x632fc5] 18: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x12) [0x66e2d2] 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x838476] 20: (ThreadPool::WorkThread::entry()+0x10) [0x83a2a0] 21: (()+0x7e9a) [0x7fd692d99e9a] 22: (clone()+0x6d) [0x7fd691617ccd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -3 2013-08-12 15:58:15.561005 7fd683d78700 1 -- 10.136.48.18:6814/21240 == osd.56 10.136.48.14:0/17437 44 osd_ping(ping e8959 stamp 2013-08-12 15:58:15.556022) v2 47+0+0 (355096560 0 0) 0xc4e81c0 con 0x12fbeb00 -2 2013-08-12 15:58:15.561038 7fd683d78700 1 -- 10.136.48.18:6814/21240 -- 10.136.48.14:0/17437 -- osd_ping(ping_reply e8959 stamp 2013-08-12 15:58:15.556022) v2 -- ?+0 0x1683ec40 con 0x12fbeb00 -1 2013-08-12 15:58:15.568600 7fd67e56d700 1 -- 10.136.48.18:6813/21240 -- osd.44 10.136.48.15:6820/25671 -- osd_sub_op(osd.20.0:1293 25.328 699ac328/rbd_data.ae2732ae8944a.00240828/head//25 [push] v 8424'11 snapset=0=[]:[] snapc=0=[]) v7 -- ?+0 0x2df0f400 0 2013-08-12 15:58:15.581608 7fd681d74700 -1 *** Caught signal (Aborted) ** in thread 7fd681d74700 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) 1: /usr/bin/ceph-osd() [0x79219a] 2: (()+0xfcb0) [0x7fd692da1cb0] 3: (gsignal()+0x35) [0x7fd69155a425] 4: (abort()+0x17b) [0x7fd69155db8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd691eac69d] 6: (()+0xb5846) [0x7fd691eaa846] 7: (()+0xb5873) [0x7fd691eaa873] 8: (()+0xb596e)
Re: [ceph-users] Help needed porting Ceph to RSockets
Hi Matthew, I am not quite sure about the POLLRDHUP. On the server side (ceph-mon), tcp_read_wait does see the POLLHUP - which should be the indicator that the the other side is shutting down. I have also taken a brief look at the client side (ceph mon stat). It initiates a shutdown - but never finishes. See attached log file from ceph --log-file ceph-mon-stat.rsockets --debug-ms 30 mon stat. I have also attached the corresponding log file for regualr TCP/IP sockets. It looks to me that in the rsockets case, the reaper is able to cleanup even though there is still sth. left to do - and hence the shutdown never completes. Best Regards Andreas Bluemle On Mon, 12 Aug 2013 15:11:47 +0800 Matthew Anderson manderson8...@gmail.com wrote: Hi Andreas, I think we're both working on the same thing, I've just changed the function calls over to rsockets in the source instead of using the pre-load library. It explains why we're having the exact same problem! From what I've been able to tell the entire problem revolves around rsockets not supporting POLLRDHUP. As far as I can tell the pipe will only be removed when tcp_read_wait returns -1. With rsockets it never receives the POLLRDHUP event after shutdown_socket() is called so the rpoll call blocks until timeout (900 seconds) and the pipe stays active. The question then would be how can we destroy a pipe without relying on POLLRDHUP? shutdown_socket() always gets called when the socket should be closed so could there might be a way to trick tcp_read_wait() into returning -1 by doing somethere in shutdown_socket() but I'm not sure how to go about it. Any ideas? On Mon, Aug 12, 2013 at 1:55 PM, Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Matthew, On Fri, 9 Aug 2013 09:11:07 +0200 Matthew Anderson manderson8...@gmail.com wrote: So I've had a chance to re-visit this since Bécholey Alexandre was kind enough to let me know how to compile Ceph with the RDMACM library (thankyou again!). At this stage it compiles and runs but there appears to be a problem with calling rshutdown in Pipe as it seems to just wait forever for the pipe to close which causes commands like 'ceph osd tree' to hang indefinitely after they work successfully. Debug MS is here - http://pastebin.com/WzMJNKZY I am currently looking at a very similar problem. My test setup is to start ceph-mon monitors and check their state using ceph mon stat. The monitors (3 instances) and the ceph mon stat command are started with LD_PRELOAD=path to librspreload.so. The behaviour is that the ceph mon stat command connects, sends the request and receives the answer, which shows a healthy state for the monitors. But the ceph mon stat does not terminate. On the monitor end I encounter an EOPNOTSUPP being set at the time the connection shall terminate. This is detected in the Pipe::tcp_read_wait() where the socket is poll'ed for IN and HUP events. What I have found out already is that it is not the poll() / rpoll() which set the error: they do return a HUP event and are happy. As far as I can tell, the fact of the EOPNOTSUPP being set is historical at that point, i.e. it must have been set at some earlier stage. I am using ceph v0.61.7. Best Regards Andreas I also tried RADOS bench but it appears to be doing something similar. Debug MS is here - http://pastebin.com/3aXbjzqS It seems like it's very close to working... I must be missing something small that's causing some grief. You can see the OSD coming up in the ceph monitor and the PG's all become active+clean. When shutting down the monitor I get the below which show's it waiting for the pipes to close - 2013-08-09 15:08:31.339394 7f4643cfd700 20 accepter.accepter closing 2013-08-09 15:08:31.382075 7f4643cfd700 10 accepter.accepter stopping 2013-08-09 15:08:31.382115 7f464bd397c0 20 -- 172.16.0.1:6789/0 wait: stopped accepter thread 2013-08-09 15:08:31.382127 7f464bd397c0 20 -- 172.16.0.1:6789/0 wait: stopping reaper thread 2013-08-09 15:08:31.382146 7f4645500700 10 -- 172.16.0.1:6789/0 reaper_entry done 2013-08-09 15:08:31.382182 7f464bd397c0 20 -- 172.16.0.1:6789/0 wait: stopped reaper thread 2013-08-09 15:08:31.382194 7f464bd397c0 10 -- 172.16.0.1:6789/0 wait: closing pipes 2013-08-09 15:08:31.382200 7f464bd397c0 10 -- 172.16.0.1:6789/0 reaper 2013-08-09 15:08:31.382205 7f464bd397c0 10 -- 172.16.0.1:6789/0 reaper done 2013-08-09 15:08:31.382210 7f464bd397c0 10 -- 172.16.0.1:6789/0 wait: waiting for pipes 0x3014c80,0x3015180,0x3015400 to close The git repo has been updated if anyone has a few spare minutes to take a look - https://github.com/funkBuild/ceph-rsockets Thanks again -Matt On Thu, Jun 20, 2013 at 5:09 PM, Matthew Anderson manderson8...@gmail.com wrote: Hi All, I've had a few conversations on
[ceph-users] rbd map issues: no such file or directory (ENOENT) AND map wrong image
Hi All, Before go on the issue description, here is our hardware configurations: - Physical machine * 3: each has quad-core CPU * 2, 64+ GB RAM, HDD * 12 (500GB ~ 1TB per drive; 1 for system, 11 for OSD). ceph OSD are on physical machines. - Each physical machine runs 5 virtual machines. One VM as ceph MON (i.e. totally 3 MONs), the other 4 VMs provides either iSCSI or FTP/NFS service - Physical machines and virtual machines are based on the same software condition: Ubuntu 12.04 + kernel 3.6.11, ceph v0.61.7 The issues we met are, 1. Right after ceph installation, create pool then create image and map is no problem. But if we do not use the whole environment more than half day, do the same process (create pool - create image - map image) will return error: no such file or directory (ENOENT). Once the issue occurs, it could be easily reproduce by the same process. But this issue may be disappear if wait 10+ minutes after pool creation. Reboot system also could avoid it. I had success and failed straces logged on the same virtual machine (the one provides FTP/NFS): success: https://www.dropbox.com/s/u8jc4umak24kr1y/rbd_done.txt failed: https://www.dropbox.com/s/ycuupmmrlc4d0ht/rbd_failed.txt 2. The second issue is to create two images (AAA and BBB) under one pool (xxx), if we map rbd -p xxx image AAA, the result is success but it shows BBB under /dev/rbd/xxx/. Use rbd showmapped, it shows AAA of pool xxx is mapped. I am not sure which one is really mapped because both images are empty. This issue is hard to reproduce but once happens /dev/rbd/ are mess-up. One more question but not about rbd map issues. Our usage is to map one rbd device and mount in several places (in one virtual machine) for iSCSI, FTP and NFS, does that cause any problem to ceph operation? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Keep Crashing
Can you post more of the log? There should be a line towards the bottom indicating the line with the failed assert. Can you also attach ceph pg dump, ceph osd dump, ceph osd tree? -Sam On Mon, Aug 12, 2013 at 11:54 AM, John Wilkins john.wilk...@inktank.comwrote: Stephane, You should post any crash bugs with stack trace to ceph-devel ceph-de...@vger.kernel.org. On Mon, Aug 12, 2013 at 9:02 AM, Stephane Boisvert stephane.boisv...@gameloft.com wrote: Hi, It seems my OSD processes keep crashing randomly and I don't know why. It seems to happens when the cluster is trying to re-balance... In normal usange I didn't notice any crash like that. We running ceph 0.61.7 on an up to date ubuntu 12.04 (all packages including kernel are current). Anyone have an idea ? TRACE: ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) 1: /usr/bin/ceph-osd() [0x79219a] 2: (()+0xfcb0) [0x7fd692da1cb0] 3: (gsignal()+0x35) [0x7fd69155a425] 4: (abort()+0x17b) [0x7fd69155db8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd691eac69d] 6: (()+0xb5846) [0x7fd691eaa846] 7: (()+0xb5873) [0x7fd691eaa873] 8: (()+0xb596e) [0x7fd691eaa96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x84303f] 10: (PG::RecoveryState::Recovered::Recovered(boost::statechart::statePG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::my_context)+0x38f) [0x6d932f] 11: (boost::statechart::statePG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::shallow_construct(boost::intrusive_ptrPG::RecoveryState::Active const, boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator)+0x5c) [0x6f270c] 12: (PG::RecoveryState::Recovering::react(PG::AllReplicasRecovered const)+0xb4) [0x6d9454] 13: (boost::statechart::simple_statePG::RecoveryState::Recovering, PG::RecoveryState::Active, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base const, void const*)+0xda) [0x6f296a] 14: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator::send_event(boost::statechart::event_base const)+0x5b) [0x6e320b] 15: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator::process_event(boost::statechart::event_base const)+0x11) [0x6e34e1] 16: (PG::handle_peering_event(std::tr1::shared_ptrPG::CephPeeringEvt, PG::RecoveryCtx*)+0x347) [0x69aaf7] 17: (OSD::process_peering_events(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x2f5) [0x632fc5] 18: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x12) [0x66e2d2] 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x838476] 20: (ThreadPool::WorkThread::entry()+0x10) [0x83a2a0] 21: (()+0x7e9a) [0x7fd692d99e9a] 22: (clone()+0x6d) [0x7fd691617ccd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -3 2013-08-12 15:58:15.561005 7fd683d78700 1 -- 10.136.48.18:6814/21240 == osd.56 10.136.48.14:0/17437 44 osd_ping(ping e8959 stamp 2013-08-12 15:58:15.556022) v2 47+0+0 (355096560 0 0) 0xc4e81c0 con 0x12fbeb00 -2 2013-08-12 15:58:15.561038 7fd683d78700 1 -- 10.136.48.18:6814/21240 -- 10.136.48.14:0/17437 -- osd_ping(ping_reply e8959 stamp 2013-08-12 15:58:15.556022) v2 -- ?+0 0x1683ec40 con 0x12fbeb00 -1 2013-08-12 15:58:15.568600 7fd67e56d700 1 -- 10.136.48.18:6813/21240 -- osd.44 10.136.48.15:6820/25671 -- osd_sub_op(osd.20.0:1293 25.328 699ac328/rbd_data.ae2732ae8944a.00240828/head//25 [push] v 8424'11 snapset=0=[]:[] snapc=0=[]) v7 -- ?+0 0x2df0f400 0 2013-08-12 15:58:15.581608 7fd681d74700 -1 *** Caught signal (Aborted) ** in thread 7fd681d74700 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) 1: /usr/bin/ceph-osd() [0x79219a] 2: (()+0xfcb0) [0x7fd692da1cb0] 3: (gsignal()+0x35) [0x7fd69155a425] 4: (abort()+0x17b)
Re: [ceph-users] ceph-deploy and journal on separate disk
Did you try using ceph-deploy disk zap ceph001:sdaa first? -Sam On Mon, Aug 12, 2013 at 6:21 AM, Pavel Timoschenkov pa...@bayonetteas.onmicrosoft.com wrote: Hi. I have some problems with create journal on separate disk, using ceph-deploy osd prepare command. When I try execute next command: ceph-deploy osd prepare ceph001:sdaa:sda1 where: sdaa – disk for ceph data sda1 – partition on ssd drive for journal I get next errors: ceph@ceph-admin:~$ ceph-deploy osd prepare ceph001:sdaa:sda1 ceph-disk-prepare -- /dev/sdaa /dev/sda1 returned 1 Information: Moved requested sector from 34 to 2048 in order to align on 2048-sector boundaries. The operation has completed successfully. meta-data=/dev/sdaa1 isize=2048 agcount=32, agsize=22892700 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=732566385, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=357698, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data mount: /dev/sdaa1: more filesystems detected. This should not happen, use -t type to explicitly specify the filesystem type or use wipefs(8) to clean up the device. mount: you must specify the filesystem type ceph-disk: Mounting filesystem failed: Command '['mount', '-o', 'noatime', '--', '/dev/sdaa1', '/var/lib/ceph/tmp/mnt.ek6mog']' returned non-zero exit status 32 Someone had a similar problem? Thanks for the help ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mounting a pool via fuse
Can you elaborate on what behavior you are looking for? -Sam On Fri, Aug 9, 2013 at 4:37 AM, Georg Höllrigl georg.hoellr...@xidras.com wrote: Hi, I'm using ceph 0.61.7. When using ceph-fuse, I couldn't find a way, to only mount one pool. Is there a way to mount a pool - or is it simply not supported? Kind Regards, Georg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)
Can you attach the output of ceph osd tree? Also, can you run ceph osd getmap -o /tmp/osdmap and attach /tmp/osdmap? -Sam On Fri, Aug 9, 2013 at 4:28 AM, Jeff Moskow j...@rtr.com wrote: Thanks for the suggestion. I had tried stopping each OSD for 30 seconds, then restarting it, waiting 2 minutes and then doing the next one (all OSD's eventually restarted). I tried this twice. -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] run ceph without auth
I have referred you to someone more conversant with the details of mkcephfs, but for dev purposes, most of us use the vstart.sh script in src/ (http://ceph.com/docs/master/dev/). -Sam On Fri, Aug 9, 2013 at 2:59 AM, Nulik Nol nulik...@gmail.com wrote: Hi, I am configuring a single node for developing purposes, but ceph asks me for keyring. Here is what I do: [root@localhost ~]# mkcephfs -c /usr/local/etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo preparing monmap in /tmp/foo/monmap /usr/local/bin/monmaptool --create --clobber --add a 127.0.0.1:6789 --print /tmp/foo/monmap /usr/local/bin/monmaptool: monmap file /tmp/foo/monmap /usr/local/bin/monmaptool: generated fsid 7bd045a6-ca45-4f12-b9f3-e0c76718859a epoch 0 fsid 7bd045a6-ca45-4f12-b9f3-e0c76718859a last_changed 2013-08-09 04:51:06.921996 created 2013-08-09 04:51:06.921996 0: 127.0.0.1:6789/0 mon.a /usr/local/bin/monmaptool: writing epoch 0 to /tmp/foo/monmap (1 monitors) \nWARNING: mkcephfs is now deprecated in favour of ceph-deploy. Please see: \n http://github.com/ceph/ceph-deploy [root@localhost ~]# mkcephfs --init-local-daemons osd -d /tmp/foo \nWARNING: mkcephfs is now deprecated in favour of ceph-deploy. Please see: \n http://github.com/ceph/ceph-deploy [root@localhost ~]# mkcephfs --init-local-daemons mds -d /tmp/foo \nWARNING: mkcephfs is now deprecated in favour of ceph-deploy. Please see: \n http://github.com/ceph/ceph-deploy [root@localhost ~]# mkcephfs --prepare-mon -d /tmp/foo Building generic osdmap from /tmp/foo/conf /usr/local/bin/osdmaptool: osdmap file '/tmp/foo/osdmap' /usr/local/bin/osdmaptool: writing epoch 1 to /tmp/foo/osdmap Generating admin key at /tmp/foo/keyring.admin creating /tmp/foo/keyring.admin Building initial monitor keyring cat: /tmp/foo/key.*: No such file or directory \nWARNING: mkcephfs is now deprecated in favour of ceph-deploy. Please see: \n http://github.com/ceph/ceph-deploy [root@localhost ~]# How can I tell ceph to do not use keyring ? This is my config file: [global] auth cluster required = none auth service required = none auth client required = none debug filestore = 20 [mon] mon data = /data/mon [mon.a] host = s1 mon addr = 127.0.0.1:6789 [osd] osd journal size = 1000 filestore_xattr_use_omap = true [osd.0] host = s1 osd data = /data/osd/osd1 osd mkfs type = bttr osd journal = /data/journal/log devs = /dev/loop0 [mds.a] host = s1 TIA Nulik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)
Sam, I've attached both files. Thanks! Jeff On Mon, Aug 12, 2013 at 01:46:57PM -0700, Samuel Just wrote: Can you attach the output of ceph osd tree? Also, can you run ceph osd getmap -o /tmp/osdmap and attach /tmp/osdmap? -Sam On Fri, Aug 9, 2013 at 4:28 AM, Jeff Moskow j...@rtr.com wrote: Thanks for the suggestion. I had tried stopping each OSD for 30 seconds, then restarting it, waiting 2 minutes and then doing the next one (all OSD's eventually restarted). I tried this twice. -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- # idweight type name up/down reweight -1 14.61 root default -3 14.61 rack unknownrack -2 2.783 host ceph1 0 0.919 osd.0 up 1 1 0.932 osd.1 up 1 2 0.932 osd.2 up 0 -5 2.783 host ceph2 3 0.919 osd.3 down0 4 0.932 osd.4 up 1 5 0.932 osd.5 up 1 -4 3.481 host ceph3 10 0.699 osd.10 up 1 6 0.685 osd.6 up 1 7 0.699 osd.7 up 1 8 0.699 osd.8 up 1 9 0.699 osd.9 up 1 -6 2.783 host ceph4 14 0.919 osd.14 down0 15 0.932 osd.15 up 1 16 0.932 osd.16 down0 -7 2.782 host ceph5 11 0.92osd.11 up 0 12 0.931 osd.12 up 1 13 0.931 osd.13 up 1 osdmap Description: Binary data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Backup monmap, osdmap, and crushmap
On 08/08/13 15:21, Craig Lewis wrote: I've seen a couple posts here about broken clusters that had to repair by modifing the monmap, osdmap, or the crush rules. The old school sysadmin in me says it would be a good idea to make backups of these 3 databases. So far though, it seems like everybody was able to repair their clusters by dumping the current map and modifying it. I'll probably do it, just to assuage my paranoia, but I was wondering what you guys thought. Well, this could get you *some* infos, but you wouldn't be able to reconstruct a monitor this way. There's just way too many maps that you'd need to reconstruct the monitor. The not-so-best approach would be to grab all map epochs, from 1 to the map's current epoch. We don't currently have a way to expose to the user what is the first available map epoch in the store (the need for it never came up), so for now you'd have to start at 1 and increment it until you'd find an existing version (we trim old versions, so that could be at 1, 10k, or a few hundred thousands, depending on how many maps you have). With all that information, you could somehow reconstruct a monitor with some effort -- and even so, we currently only expose an interface to obtain maps for some services such as the mon, osd, pg and mds; we have a bunch of other versions kept in the monitor that are not currently exposed to the user. This is something we definitely want to improve on, but as of this moment the best approach to backup monitors reliably would be to stop the monitor, copy the store, and restart the monitor. Assuming you have 3+ monitors, stopping just one of them wouldn't affect the quorum or cluster availability. And assuming you're backing up a monitor that is in the quorum, then backing it up is as good as backing any other monitor. Hope this helps. -Joao I'm thinking of cronning this on the MON servers: #!/usr/bin/env bash # Number of days to keep backups cleanup_age=10 # Fetch the current timestamp, to use in the backup filenames date=$(date +%Y-%m-%dT%H:%M:%S) # Dump the current maps cd /var/lib/ceph/backups/ ceph mon getmap -o ./monmap.${date} ceph osd getmap -o ./osdmap.${date} ceph osd getcrushmap -o ./crushmap.${date} # Delete old maps find . -type f -regextype posix-extended -regex '\./(mon|osd|crush)map\..*' -mtime +${cleanup_age} -print0 | xargs -0 rm ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)
Are you using any kernel clients? Will osds 3,14,16 be coming back? -Sam On Mon, Aug 12, 2013 at 2:26 PM, Jeff Moskow j...@rtr.com wrote: Sam, I've attached both files. Thanks! Jeff On Mon, Aug 12, 2013 at 01:46:57PM -0700, Samuel Just wrote: Can you attach the output of ceph osd tree? Also, can you run ceph osd getmap -o /tmp/osdmap and attach /tmp/osdmap? -Sam On Fri, Aug 9, 2013 at 4:28 AM, Jeff Moskow j...@rtr.com wrote: Thanks for the suggestion. I had tried stopping each OSD for 30 seconds, then restarting it, waiting 2 minutes and then doing the next one (all OSD's eventually restarted). I tried this twice. -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to set Object Size/Stripe Width/Stripe Count?
I think the docs you are looking for are http://ceph.com/docs/master/man/8/cephfs/ (specifically the set_layout command). -Sam On Thu, Aug 8, 2013 at 7:48 AM, Da Chun ng...@qq.com wrote: Hi list, I saw the info about data striping in http://ceph.com/docs/master/architecture/#data-striping . But couldn't find the way to set these values. Could you please tell me how to that or give me a link? Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Start Stop OSD
On 08/12/2013 04:49 AM, Joshua Young wrote: I have 2 issues that I can not find a solution to. First: I am unable to stop / start any osd by command. I have deployed with ceph-deploy on Ubuntu 13.04 and everything seems to be working find. I have 5 hosts 5 mons and 20 osds. Using initctl list | grep ceph gives me ceph-osd (ceph/15) start/running, process 2122 The fact that only one is output means that upstart believes there's only one OSD job running. Are you sure the other daemons are actually alive and started by upstart? However OSD 12 13 14 15 are all on this server. sudo stop ceph-osd id=12 gives me stop: Unknown instance: ceph/12 Does anyone know what is wrong? Nothing in logs. Also, when trying to put the journal on an SSD everything works fine. I can add all 4 disks per host to the same SSD. The issue is when I restart the server, only 1 out of the 3 OSDs will come back up. Has anyone else had this issue? Are you using partitions on the SSD? If not, that's obviously going to be a problem; the device is usable by only one journal at a time. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph pgs stuck unclean
Can you attach the output of: ceph -s ceph pg dump ceph osd dump and run ceph osd getmap -o /tmp/osdmap and attach /tmp/osdmap/ -Sam On Wed, Aug 7, 2013 at 1:58 AM, Howarth, Chris chris.howa...@citi.com wrote: Hi, One of our OSD disks failed on a cluster and I replaced it, but when it failed it did not completely recover and I have a number of pgs which are stuck unclean: # ceph health detail HEALTH_WARN 7 pgs stuck unclean pg 3.5a is stuck unclean for 335339.172516, current state active, last acting [5,4] pg 3.54 is stuck unclean for 335339.157608, current state active, last acting [15,7] pg 3.55 is stuck unclean for 335339.167154, current state active, last acting [16,9] pg 3.1c is stuck unclean for 335339.174150, current state active, last acting [8,16] pg 3.a is stuck unclean for 335339.177001, current state active, last acting [0,8] pg 3.4 is stuck unclean for 335339.165377, current state active, last acting [17,4] pg 3.5 is stuck unclean for 335339.149507, current state active, last acting [2,6] Does anyone know how to fix these ? I tried the following, but this does not seem to work: # ceph pg 3.5 mark_unfound_lost revert pg has no unfound objects thanks Chris __ Chris Howarth OS Platforms Engineering Citi Architecture Technology Engineering (e) chris.howa...@citi.com (t) +44 (0) 20 7508 3848 (f) +44 (0) 20 7508 0964 (mail-drop) CGC-06-3A ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] could not generate the bootstrap key
Can you give a step by step account of what you did prior to the error? -Sam On Tue, Aug 6, 2013 at 10:52 PM, 於秀珠 yuxiu...@jovaunn.com wrote: using the ceph-deploy to manage a existing cluster,i follow the steps in the document ,but there is some errors that i can not gather the keys. when i run the command ceph-deploy gatherkeys PS-16,the logs show below: 2013-08-07 10:14:08,579 ceph_deploy.gatherkeys DEBUG Have ceph.client.admin.keyring 2013-08-07 10:14:08,579 ceph_deploy.gatherkeys DEBUG Checking PS-16 for /var/lib/ceph/mon/ceph-{hostname}/keyring 2013-08-07 10:14:08,674 ceph_deploy.gatherkeys DEBUG Got ceph.mon.keyring key from PS-16. 2013-08-07 10:14:08,674 ceph_deploy.gatherkeys DEBUG Checking PS-16 for /var/lib/ceph/bootstrap-osd/ceph.keyring 2013-08-07 10:14:08,774 ceph_deploy.gatherkeys WARNING Unable to find /var/lib/ceph/bootstrap-osd/ceph.keyring on ['PS-16'] 2013-08-07 10:14:08,774 ceph_deploy.gatherkeys DEBUG Checking PS-16 for /var/lib/ceph/bootstrap-mds/ceph.keyring 2013-08-07 10:14:08,874 ceph_deploy.gatherkeys WARNING Unable to find /var/lib/ceph/bootstrap-mds/ceph.keyring on ['PS-16'] and i try to deploy a new ceph cluster,i meet the same problem,when i create the mon ,and then gather the key ,but i also can not gather the bootstrap keys, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)
Sam, 3, 14 and 16 have been down for a while and I'll eventually replace those drives (I could do it now) but didn't want to introduce more variables. We are using RBD with Proxmox, so I think the answer about kernel clients is yes Jeff On Mon, Aug 12, 2013 at 02:41:11PM -0700, Samuel Just wrote: Are you using any kernel clients? Will osds 3,14,16 be coming back? -Sam On Mon, Aug 12, 2013 at 2:26 PM, Jeff Moskow j...@rtr.com wrote: Sam, I've attached both files. Thanks! Jeff On Mon, Aug 12, 2013 at 01:46:57PM -0700, Samuel Just wrote: Can you attach the output of ceph osd tree? Also, can you run ceph osd getmap -o /tmp/osdmap and attach /tmp/osdmap? -Sam On Fri, Aug 9, 2013 at 4:28 AM, Jeff Moskow j...@rtr.com wrote: Thanks for the suggestion. I had tried stopping each OSD for 30 seconds, then restarting it, waiting 2 minutes and then doing the next one (all OSD's eventually restarted). I tried this twice. -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why is my mon store.db is 220GB?
Following a discussion we had today on #ceph, I've added some extra functionality to 'ceph-monstore-tool' to allow copying the data out of a store into a new mon store, and can be found on branch wip-monstore-copy. Using it as ceph-monstore-tool --mon-store-path mon-data-dir --out mon-data-out --command store-copy with mon-data-dir being the mon data dir where the current monitor lives (say, /var/lib/ceph/mon/ceph-a), and mon-data-out being another directory. This last directory should be empty, allowing the tool to create a new store, but if a store already exists it will not error out, copying instead the keys from the first store to the already existing store, so beware! Also, should bear in mind that you must stop the monitor while doing this -- the tool won't work otherwise. Anyway, this should allow you to grab all your data from the current monitor. You'll be presented with a few stats when the store finishes being copied, and hopefully you'll see that the tool didn't copy 220GB worth of data -- should be considerably less! Let me know if this works out for you. -Joao On 07/08/13 15:14, Jeppesen, Nelson wrote: Joao, Have you had a chance to look at my monitor issues? I Ran ''ceph-mon -i FOO -compact' last week but it did not improve disk usage. Let me know if there's anything else I dig up. The monitor still at 0.67-rc2 with the OSDs at .0.61.7. On 08/02/2013 12:15 AM, Jeppesen, Nelson wrote: Thanks for the reply, but how can I fix this without an outage? I tired adding 'mon compact on start = true' but the monitor just hung. Unfortunately this is a production cluster and can't take the outages (I'm assuming the cluster will fail without a monitor). I had three monitors I was hit with the store.db bug and lost two of the three. I have tried running with 0.61.5, .0.61.7 and 0.67-rc2. None of them seem to shrink the DB. My guess is that the compaction policies we are enforcing won't cover the portions of the store that haven't been compacted *prior* to the upgrade. Even today we still know of users with stores growing over dozens of GBs, requiring occasional restarts to compact (which is far from an acceptable fix). Some of these stores can take several minutes to compact when the monitors are restarted, although these guys can often mitigate any down time by restarting monitors one at a time while maintaining quorum. Unfortunately you don't have that luxury. :-\ If however you are willing to manually force a compaction, you should be able to do so with 'ceph-mon -i FOO --compact'. Now, there is a possibility this is why you've been unable to add other monitors to the cluster. Chances are that the iterators used to synchronize the store get stuck, or move slowly enough to make all sorts of funny timeouts to be triggered. I intend to look into your issue (especially the problems with adding new monitors) in the morning to better assess what's happening. -Joao -Original Message- From: Mike Dawson [mailto:mike.dawson at cloudapt.com] Sent: Thursday, August 01, 2013 4:10 PM To: Jeppesen, Nelson Cc: ceph-users at lists.ceph.com Subject: Re: [ceph-users] Why is my mon store.db is 220GB? 220GB is way, way too big. I suspect your monitors need to go through a successful leveldb compaction. The early releases of Cuttlefish suffered several issues with store.db growing unbounded. Most were fixed by 0.61.5, I believe. You may have luck stoping all Ceph daemons, then starting the monitor by itself. When there were bugs, leveldb compaction tended work better without OSD traffic hitting the monitors. Also, there are some settings to force a compact on startup like 'mon compact on start = true' and mon compact on trim = true. I don't think either are required anymore though. See some history here: http://tracker.ceph.com/issues/4895 Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 8/1/2013 6:52 PM, Jeppesen, Nelson wrote: My Mon store.db has been at 220GB for a few months now. Why is this and how can I fix it? I have one monitor in this cluster and I suspect that I can't add monitors to the cluster because it is too big. Thank you. ___ ceph-users mailing list ceph-users at lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users at lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Backup monmap, osdmap, and crushmap
You saved me a bunch of time; I was planning to test my backup and restore later today. Thanks! It occurred to me that the backups won't be as useful as I thought. I'd need to make sure that the PGs hadn't moved around after the backup was made. If they had, I'd spend a lot of time tracking down the new locations and manually rsyncing data. Not a big deal on a small cluster, but it get harder as the cluster gets larger. Now that I look at the dumps, it looks like PG locations are one of the things missing. Binary backups of the MON directories are fine. All of the problems I've seen on the list occurred during cluster upgrades, so I'll make the backup part of my upgrade procedure instead of a cron. If I wanted to restore a backup, what would be required? Looking at http://eu.ceph.com/docs/v0.47.1/ops/manage/grow/mon/#removing-a-monitor-from-an-unhealthy-or-down-cluster, I don't see a monmap directory inside /var/lib/ceph/mon/ceph-#/ anymore. I assume it went away during the switch to LevelDB, so I think I'll need to dump a copy when I make the binary backup. I'm assuming there is some node specific data in each MON's store. If not, could I just stop all monitors, rsync the backup to all monitors, and start them all up? I'm not using cephx, but I assume the keyrings would complicate things. It would probably be easiest to make binary backups on all of the monitors (with some time delay, so only one is offline at a time), then start them up in newest-to-oldest-backup order. Or I could use LVM snapshots to make simultaneous backups on all monitors. Thanks for the info. *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ On 8/12/13 14:39 , Joao Eduardo Luis wrote: On 08/08/13 15:21, Craig Lewis wrote: I've seen a couple posts here about broken clusters that had to repair by modifing the monmap, osdmap, or the crush rules. The old school sysadmin in me says it would be a good idea to make backups of these 3 databases. So far though, it seems like everybody was able to repair their clusters by dumping the current map and modifying it. I'll probably do it, just to assuage my paranoia, but I was wondering what you guys thought. Well, this could get you *some* infos, but you wouldn't be able to reconstruct a monitor this way. There's just way too many maps that you'd need to reconstruct the monitor. The not-so-best approach would be to grab all map epochs, from 1 to the map's current epoch. We don't currently have a way to expose to the user what is the first available map epoch in the store (the need for it never came up), so for now you'd have to start at 1 and increment it until you'd find an existing version (we trim old versions, so that could be at 1, 10k, or a few hundred thousands, depending on how many maps you have). With all that information, you could somehow reconstruct a monitor with some effort -- and even so, we currently only expose an interface to obtain maps for some services such as the mon, osd, pg and mds; we have a bunch of other versions kept in the monitor that are not currently exposed to the user. This is something we definitely want to improve on, but as of this moment the best approach to backup monitors reliably would be to stop the monitor, copy the store, and restart the monitor. Assuming you have 3+ monitors, stopping just one of them wouldn't affect the quorum or cluster availability. And assuming you're backing up a monitor that is in the quorum, then backing it up is as good as backing any other monitor. Hope this helps. -Joao I'm thinking of cronning this on the MON servers: #!/usr/bin/env bash # Number of days to keep backups cleanup_age=10 # Fetch the current timestamp, to use in the backup filenames date=$(date +%Y-%m-%dT%H:%M:%S) # Dump the current maps cd /var/lib/ceph/backups/ ceph mon getmap -o ./monmap.${date} ceph osd getmap -o ./osdmap.${date} ceph osd getcrushmap -o ./crushmap.${date} # Delete old maps find . -type f -regextype posix-extended -regex '\./(mon|osd|crush)map\..*' -mtime +${cleanup_age} -print0 | xargs -0 rm ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd map issues: no such file or directory (ENOENT) AND map wrong image
On 08/12/2013 10:19 AM, PJ wrote: Hi All, Before go on the issue description, here is our hardware configurations: - Physical machine * 3: each has quad-core CPU * 2, 64+ GB RAM, HDD * 12 (500GB ~ 1TB per drive; 1 for system, 11 for OSD). ceph OSD are on physical machines. - Each physical machine runs 5 virtual machines. One VM as ceph MON (i.e. totally 3 MONs), the other 4 VMs provides either iSCSI or FTP/NFS service - Physical machines and virtual machines are based on the same software condition: Ubuntu 12.04 + kernel 3.6.11, ceph v0.61.7 The issues we met are, 1. Right after ceph installation, create pool then create image and map is no problem. But if we do not use the whole environment more than half day, do the same process (create pool - create image - map image) will return error: no such file or directory (ENOENT). Once the issue occurs, it could be easily reproduce by the same process. But this issue may be disappear if wait 10+ minutes after pool creation. Reboot system also could avoid it. This sounds similar to http://tracker.ceph.com/issues/5925 - and your case suggests it may be a monitor bug, since that test is userspace and you're using the kernel client. Could you reproduce this with logs from your monitors from the time of pool creation to after the map fails with ENOENT, and these log settings on all mons: debug ms = 1 debug mon = 20 debug paxos = 10 If you could attach those logs to the bug or otherwise make them available that'd be great. I had success and failed straces logged on the same virtual machine (the one provides FTP/NFS): success: https://www.dropbox.com/s/u8jc4umak24kr1y/rbd_done.txt failed: https://www.dropbox.com/s/ycuupmmrlc4d0ht/rbd_failed.txt Unfortunately these won't tell us much since the kernel is doing all the work with rbd map. 2. The second issue is to create two images (AAA and BBB) under one pool (xxx), if we map rbd -p xxx image AAA, the result is success but it shows BBB under /dev/rbd/xxx/. Use rbd showmapped, it shows AAA of pool xxx is mapped. I am not sure which one is really mapped because both images are empty. This issue is hard to reproduce but once happens /dev/rbd/ are mess-up. That sounds very strange, since 'rbd showmapped' and the udev rule that creates the /dev/rbd/pool/image symlinks use the same data source - /sys/bus/rbd/N/name. This sounds like a race condition where sysfs is being read (and reading stale memory) before the kernel finishes populating it. Could you file this in the tracker? Checking whether it still occurs in linux 3.10 would be great too. It doesn't seem possible with the current code. One more question but not about rbd map issues. Our usage is to map one rbd device and mount in several places (in one virtual machine) for iSCSI, FTP and NFS, does that cause any problem to ceph operation? If it's read-only everywhere, it's fine, but otherwise you'll run into problems unless you've got something on top of rbd managing access to it, like ocfs2. You could use nfs on top of one rbd device, but having multiple nfs servers on top of the same rbd device won't work unless they can coordinate with each other. The same applies to iscsi and ftp. Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph instead of RAID
Hello community, I am currently installing some backup servers with 6x3TB drives in them. I played with RAID-10 but I was not impressed at all with how it performs during a recovery. Anyway, I thought what if instead of RAID-10 I use ceph? All 6 disks will be local, so I could simply create 6 local OSDs + a monitor, right? Is there anything I need to watch out for in such configuration? Another thing. I am using ceph-deploy and I have noticed that when I do this: ceph-deploy --verbose new localhost the ceph.conf file is created in the current folder instead of /etc. Is this normal? Also, in the ceph.conf there's a line: mon host = ::1 Is this normal or I need to change this to point to localhost? Thanks for any feedback on this. Dmitry ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pgs stuck unclean since forever, current state active+remapped
i got PGs stuck long time. do not how to fix it. can some person help to check? Environment: Debian 7 + ceph 0.617 root@ceph-admin:~# ceph -s health HEALTH_WARN 6 pgs stuck unclean monmap e2: 2 mons at {a=192.168.250.15:6789/0,b=192.168.250.8:6789/0}, election epoch 8, quorum 0,1 a,b osdmap e159: 4 osds: 4 up, 4 in pgmap v23487: 584 pgs: 578 active+clean, 6 active+remapped; 4513 MB data, 12658 MB used, 387 GB / 399 GB avail; 426B/s wr, 0op/s mdsmap e114: 1/1/1 up {0=a=up:active}, 1 up:standby -- root@ceph-admin:~# ceph health detail HEALTH_WARN 6 pgs stuck unclean pg 0.50 is stuck unclean since forever, current state active+remapped, last acting [3,1] pg 1.4f is stuck unclean since forever, current state active+remapped, last acting [3,1] pg 2.4e is stuck unclean since forever, current state active+remapped, last acting [3,1] pg 1.8a is stuck unclean since forever, current state active+remapped, last acting [2,1] pg 0.8b is stuck unclean since forever, current state active+remapped, last acting [2,1] pg 2.89 is stuck unclean since forever, current state active+remapped, last acting [2,1] -- root@ceph-admin:~# ceph osd tree # idweight type name up/down reweight -1 4 root default -3 2rack unknownrack -2 2 host ceph-admin 0 1 osd.0 up 1 1 1 osd.1 up 1 -4 1host ceph-node02 2 1 osd.2 down1 -5 1host ceph-node01 3 1 osd.3 up 1 --- root@ceph-admin:~# ceph osd dump epoch 159 fsid db32486a-7ad3-4afe-8b67-49ee2a6dcecf created 2013-08-08 13:45:52.579015 modified 2013-08-12 05:18:37.895385 flags pool 0 'data' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 192 pgp_num 192 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 192 pgp_num 192 last_change 1 owner 0 pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 192 pgp_num 192 last_change 1 owner 0 pool 3 'volumes' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 owner 18446744073709551615 max_osd 5 osd.0 up in weight 1 up_from 138 up_thru 157 down_at 137 last_clean_interval [45,135) 192.168.250.15:6803/5735 192.168.250.15:6804/5735 192.168.250.15:6805/5735 exists,up 99f2aec0-2367-4b68-86f2-58d6d41589c6 osd.1 up in weight 1 up_from 140 up_thru 157 down_at 137 last_clean_interval [47,136) 192.168.250.15:6806/6882 192.168.250.15:6807/6882 192.168.250.15:6808/6882 exists,up d458ca35-ec55-47a9-a7ce-47b9ddf4d889 osd.2 up in weight 1 up_from 157 up_thru 158 down_at 135 last_clean_interval [48,134) 192.168.250.8:6800/3564 192.168.250.8:6801/3564 192.168.250.8:6802/3564 exists,up c4ee9f05-bd5f-4536-8cb8-0af82c00d3d6 osd.3 up in weight 1 up_from 143 up_thru 157 down_at 141 last_clean_interval [53,141) 192.168.250.16:6802/14618 192.168.250.16:6804/14618 192.168.250.16:6805/14618 exists,up e9d67b85-97d1-4635-95c8-f7c50cd7f6b1 pg_temp 0.50 [3,1] pg_temp 0.8b [2,1] pg_temp 1.4f [3,1] pg_temp 1.8a [2,1] pg_temp 2.4e [3,1] pg_temp 2.89 [2,1] -- root@ceph-admin:/etc/ceph# crushtool -d /tmp/crushmap # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 root # buckets host ceph-admin { id -2 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.1 weight 1.000 } rack unknownrack { id -3 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item ceph-admin weight 2.000 } host ceph-node02 { id -4 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.2 weight 1.000 } host ceph-node01 { id -5 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.3 weight 1.000 } root default { id -1 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item unknownrack weight 2.000 item ceph-node02 weight 1.000 item ceph-node01 weight 1.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } rule volumes { ruleset 3 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } rule metadata { ruleset 1 type replicated min_size
Re: [ceph-users] Ceph instead of RAID
On 08/12/2013 06:49 PM, Dmitry Postrigan wrote: Hello community, I am currently installing some backup servers with 6x3TB drives in them. I played with RAID-10 but I was not impressed at all with how it performs during a recovery. Anyway, I thought what if instead of RAID-10 I use ceph? All 6 disks will be local, so I could simply create 6 local OSDs + a monitor, right? Is there anything I need to watch out for in such configuration? I mean, you can certainly do that. 1 mon and all OSDs on one server is not particularly fault-tolerant, perhaps, but if you have multiple such servers in the cluster, sure, why not? Another thing. I am using ceph-deploy and I have noticed that when I do this: ceph-deploy --verbose new localhost the ceph.conf file is created in the current folder instead of /etc. Is this normal? Yes. ceph-deploy also distributes ceph.conf where it needs to go. Also, in the ceph.conf there's a line: mon host = ::1 Is this normal or I need to change this to point to localhost? You want to configure the machines such that they have resolvable 'real' IP addresses: http://ceph.com/docs/master/start/quick-start-preflight/#hostname-resolution Thanks for any feedback on this. Dmitry ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Dan Mick, Filesystem Engineering Inktank Storage, Inc. http://inktank.com Ceph docs: http://ceph.com/docs ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why is my mon store.db is 220GB?
Joao, (log file uploaded to http://pastebin.com/Ufrxn6fZ) I had some good luck and some bad luck. I copied the store.db to a new monitor, injected a modified monmap and started it up (This is all on the same host.) Very quickly it reached quorum (as far as I can tell) but didn't respond. Running 'ceph -w' just hung, no timeouts or errors. Same thing when restarting an OSD. The last lines of the log file '...ms_verify_authorizer..' are from 'ceph -w' attempts. I restarted everything again and it sat there synchronizing. IO stat reported about 100MB/s, but just reads. I let it sit there for 7 min but nothing happened. Side question, how long can a ceph cluster run without a monitor? I was able to upload files via rados gateway without issue even when the monitor was down. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com