[ceph-users] rbd snap ls: how much locking is involved?
Hi, some of our applications (e.g., backy) use 'rbd snap ls' quite often. I see regular occurrences of blocked requests on a headly loaded cluster which correspond to snap_list operations. Log file example: 2016-01-20 11:38:14.389325 osd.13 172.22.4.44:6803/13012 40529 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 15.098679 secs 2016-01-20 11:38:14.389336 osd.13 172.22.4.44:6803/13012 40530 : cluster [WRN] slow request 15.098679 seconds old, received at 2016-01-20 11:37:59.276665: osd_op(client.256532559.0:2041 rbd_data.c390a692ae8944a.057b@snapdir [list-snaps] 266.95976dde ack+read+known_if_redirected e807541) currently no flag points reached Does anyone know if 'rbd snap ls' creates locks? On which level are these locks created (volume, pool, global)? Would it be best to reduce the usage of 'rbd snap ls' on a heavly loaded cluster? TIA Christian -- Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd snap ls: how much locking is involved?
Am 21.01.2016 um 15:32 schrieb Jason Dillaman: > Are you performing a lot of 'rbd export-diff' or 'rbd diff' operations? I > can't speak to whether or not list-snaps is related to your blocked requests, > but I can say that operation is only issued when performing RBD diffs. Yes, we are also doing 'rbd export-diff' on snapshots. So this could be the cause, too. Regards Christian -- Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests after "osd in"
Am 10.12.2015 um 06:38 schrieb Robert LeBlanc: > Since I'm very interested in > reducing this problem, I'm willing to try and submit a fix after I'm > done with the new OP queue I'm working on. I don't know the best > course of action at the moment, but I hope I can get some input for > when I do try and tackle the problem next year. Is there already a ticket present for this issue in the bug tracker? I think this is an import issue. Regards Christian -- Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests after "osd in"
Am 10.12.2015 um 06:38 schrieb Robert LeBlanc: > I noticed this a while back and did some tracing. As soon as the PGs > are read in by the OSD (very limited amount of housekeeping done), the > OSD is set to the "in" state so that peering with other OSDs can > happen and the recovery process can begin. The problem is that when > the OSD is "in", the clients also see that and start sending requests > to the OSDs before it has had a chance to actually get its bearings > and is able to even service the requests. After discussion with some > of the developers, there is no easy way around this other than let the > PGs recover to other OSDs and then bring in the OSDs after recovery (a > ton of data movement). Many thanks for your detailed analysis. It's a bit disappointing that there seems to be no easy way around. Any work to improve the situation is much appreciated. In the meantime, I'll be experimenting with pre-seeding the VFS cache to speed things up at least a little bit. Regards Christian -- Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Blocked requests after "osd in"
Hi, I'm getting blocked requests (>30s) every time when an OSD is set to "in" in our clusters. Once this has happened, backfills run smoothly. I have currently no idea where to start debugging. Has anyone a hint what to examine first in order to narrow this issue? TIA Christian -- Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests after "osd in"
313: 1800 pgs: 277 active+remapped+wait_backfill, 881 active+remapped, 4 active+remapped+backfilling, 638 active+clean; 439 GB data, 906 GB used, 7700 GB / 8607 GB avail; 347 kB/s rd, 2551 kB/s wr, 261 op/s; 162079/313904 objects misplaced (51.633%); 218 MB/s, 54 objects/s recovering I've used Brendan Greggs opensnoop utility to see what is going on on the filesystem level (see attached log). AFAICS the OSB reads lots of directories. The underlying filesystem is XFS, so this should be sufficiently fast. During the time I see slow requests, the OSD continuously opens omap/*.ldb and omap/*.log files (starting at timestamp 95927.111837 in opensnoop log which equivalents 15:06:37 in wall clock time). Any idea how to reduce the blockage at least? > It's unclear to me whether MONs influence this somehow (the peering stage) > but I have observed their CPU usage and IO also spikes when OSDs are started, > so make sure they are not under load. I don't think this is an issue here. Our MONs don't use more than 5% CPU during the operation and don't cause significant amounts of disk I/O. Regards Christian -- Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick osd-opensnoop.log.gz Description: application/gzip ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] nf_conntrack overflow crashes OSDs
Hi, today I'd like to share a severe problem we've found (and fixed) on our Ceph cluster. We're running 48 OSDs (8 per host). While restarting all OSDs on a host, the kernel's nf_conntrack table was overflown. This rendered all OSDs on that machine unusable. The symptoms were as follows. In the kernel log, we saw lines like: | Aug 6 15:23:48 cartman06 kernel: [12713575.554784] nf_conntrack: table full, dropping packet This is effectively a DoS against the kernel's IP stack. In the OSD log files, we saw repeated connection attempts like: | 2014-08-06 15:22:35.348175 7f92f25a8700 10 -- 172.22.4.42:6802/9560 172.22.4.51:0/2025662 pipe(0x7f9208035440 sd=382 :6802 s=2 pgs=26750 cs=1 l=1 c=0x7f92080021c0).fault on lossy channel, failing | 2014-08-06 15:22:35.348287 7f8fd69e4700 10 -- 172.22.4.42:6802/9560 172.22.4.39:0/3024957 pipe(0x7f9208007b30 sd=149 :6802 s=2 pgs=245725 cs=1 l=1 c=0x7f9208036630).fault on lossy channel, failing | 2014-08-06 15:22:35.348293 7f8fe24e4700 20 -- 172.22.4.42:6802/9560 172.22.4.38:0/1013265 pipe(0x7f92080476e0 sd=450 :6802 s=4 pgs=32439 cs=1 l=1 c=0x7f9208018e90).writer finishing | 2014-08-06 15:22:35.348284 7f8fd4fca700 2 -- 172.22.4.42:6802/9560 172.22.4.5:0/3032136 pipe(0x7f92080686b0 sd=305 :6802 s=2 pgs=306100 cs=1 l=1 c=0x7f920805f340).fault 0: Success | 2014-08-06 15:22:35.348292 7f8fd108b700 20 -- 172.22.4.42:6802/9560 172.22.4.4:0/1000901 pipe(0x7f920802e7d0 sd=401 :6802 s=4 pgs=73173 cs=1 l=1 c=0x7f920802eda0).writer finishing | 2014-08-06 15:22:35.344719 7f8fd1d98700 2 -- 172.22.4.42:6802/9560 172.22.4.49:0/3026524 pipe(0x7f9208033a80 sd=492 :6802 s=2 pgs=12845 cs=1 l=1 c=0x7f9208033ce0).reader couldn't read tag, Success and so on, generating 1000s of log lines. The OSDs were spinning with 100% CPU, trying to re-connect in rapid succession. The repeated connection attempts stopped nf_conntrack from getting out of its overflown state. Thus, we saw blocked requests for 15 minutes or so, until the MONs banned the stuck OSDs from the cluster. As a short term countermeasure, we stopped all OSDs on the affected hosts and started them one by one, leaving enough time in between to allow the recovery settle a bit (10 sec gap between OSDs was enough). During normal operation, we see only 5000-6000 connections on a host. As a permanent fix, we have doubled the size of the nf_conntrack table and reduced some timeouts according to http://www.pc-freak.net/blog/resolving-nf_conntrack-table-full-dropping-packet-flood-message-in-dmesg-linux-kernel-log/. Now a restart of all 8 OSDs on a host works without problems. Alternatively, we have considered removing nf_conntrack completely. This, however, is not possible since we use host-based firewalling and nf_conntrack is wired quite deeply into Linux' firewall code. Just to share our experience in case someone experiences the same problem. Regards Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] nf_conntrack overflow crashes OSDs
Am 08.08.2014 um 14:05 schrieb Robert van Leeuwen: It is also possible to specifically not conntrack certain connections. e.g. iptables -t raw -A PREROUTING -p tcp --dport 6789 -j CT --notrack Thanks Robert. This is really an interesting approach. We will test it. Regards Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph RBD and Backup.
Am 03.07.2014 07:21, schrieb Irek Fasikhov: Dear community. How do you make backups CEPH RDB? We @ gocept are currently in the process of developing backy, a new-style backup tool that works directly with block level snapshots / diffs. The tool is not quite finished, but it is making rapid progress. It would be great if you'd try it, spot bugs, contribute code etc. Help is appreciated. :-) PyPI page: https://pypi.python.org/pypi/backy/ Pull requests go here: https://bitbucket.org/ctheune/backy Christian Theune c...@gocept.com is the primary contact. HTH Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to improve performance of ceph objcect storage cluster
Am 26.06.2014 20:05, schrieb Aronesty, Erik: Well, it's the same for rbd, what's your stripe count set to? For a small system, it should be at least the # of nodes in your system.As systems get larger, there's limited returns... I would imagine there would be some OSD caching advantage to keeping the number limited (IE: more requests of the same device = more likely the device has the next stripe unit prefetched). I'm trying to make sure I understand this: usually you can't set the stripe count directly, but you can set the default stripe size of RBD volumes. So in consequence, does this mean to go with a larger RBD object size than the default (4MiB)? Regards Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Behaviour of ceph pg repair on different replication levels
Am 26.06.2014 02:08, schrieb Gregory Farnum: It's a good idea, and in fact there was a discussion yesterday during the Ceph Developer Summit about making scrub repair significantly more powerful; they're keeping that use case in mind in addition to very fine-grained ones like specifying a particular replica for every object. +1 This would be very cool. Yeah, it's got nothing and is relying on the local filesystem to barf if that happens. Unfortunately, neither xfs nor ext4 provide that checking functionality (which is one of the reasons we continue to look to btrfs as our long-term goal). When thinking in petabytes scale, bit rot going to happen as a matter of fact. So I think Ceph should be prepared, at least when there are more than 2 replicas. Regards Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Behaviour of ceph pg repair on different replication levels
Am 23.06.2014 20:24, schrieb Gregory Farnum: Well, actually it always takes the primary copy, unless the primary has some way of locally telling that its version is corrupt. (This might happen if the primary thinks it should have an object, but it doesn't exist on disk.) But there's not a voting or anything at this time. Thanks Greg for the clarification. I wonder if some sort of voting during recovery would be feasible to implement. Having this available would make a 3x replica scheme immensely more useful. In my current understanding Ceph has no guards against local bit rot (e.g., when a local disk returns incorrect data). Or is there already a voting scheme in place during deep scrub? Regards Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] trying to interpret lines in osd.log
I see several instances of the following log messages in the OSD logs each day: 2014-06-21 02:05:27.740697 7fbc58b78700 0 -- 172.22.8.12:6810/31918 172.22.8.12:6800/28827 pipe(0x7fbe400029f0 sd=764 :6810 s=0 pgs=0 cs=0 l=0 c=0x7fbe40003190).accept connect_seq 30 vs existing 29 state standby 2014-06-21 07:44:29.437810 7fbc452cb700 0 -- 172.22.8.12:6810/31918 172.22.8.16:6802/31292 pipe(0x7fbe40002d90 sd=748 :6810 s=2 pgs=11345 cs=57 l=0 c=0x7fbf68eb2a70).fault with nothing to send, going to standby What does this mean? Anything to worry about? TIA Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs
Am 12.06.2014 14:09, schrieb Loic Dachary: With the replication factor set to three (which is the default), it can tolerate that two OSD fail at the same time. I've noticed that a replication factor of 3 is the new default in firefly. What rationale led to changing the default? It used to be 2. A replication factor of 3 incurs significantly more space overhead. Has a replication factor of 2 been proven to be insecure? Regards Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] error (24) Too many open files
Hi, we have a Ceph cluster with 32 OSDs running on 4 servers (8 OSDs per server, one for each disk). From time to time, I see Ceph servers running out of file descriptors. It logs lines like: 2014-06-08 22:15:35.154759 7f850ac25700 0 filestore(/srv/ceph/osd/ceph-20) write couldn't open 86.37_head/a63e7df7/rbd_data.1933fe2ae8944a.042c/head//86: (24) Too many open files 2014-06-08 22:15:35.255955 7f850ac25700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f850ac25700 time 2014-06-08 22:15:35.191181 os/FileStore.cc: 2448: FAILED assert(0 == unexpected error) but apparently everything proceeds normally after that. Is the error considered critical? Should I lower max open files in ceph.conf? Or should I increase the value in /proc/sys/fs/file-max? Has anyone a good recommendation? TIA Christian Reference: * we are running Ceph Emperor 0.72.2 on Linux 3.10.7. * full log follows: 2014-06-08 22:15:34.928660 7f84e6770700 0 cls cls/lock/cls_lock.cc:89: error reading xattr lock.rbd_lock: -24 2014-06-08 22:15:34.934733 7f84e6770700 0 cls cls/lock/cls_lock.cc:384: Could not read lock info: Unknown error -24 2014-06-08 22:15:35.085361 7f84ecf7d700 0 accepter.accepter no incoming connection? sd = -1 errno 24 Too many open files 2014-06-08 22:15:35.125393 7f84ecf7d700 0 accepter.accepter no incoming connection? sd = -1 errno 24 Too many open files 2014-06-08 22:15:35.125403 7f84ecf7d700 0 accepter.accepter no incoming connection? sd = -1 errno 24 Too many open files 2014-06-08 22:15:35.125407 7f84ecf7d700 0 accepter.accepter no incoming connection? sd = -1 errno 24 Too many open files 2014-06-08 22:15:35.125410 7f84ecf7d700 0 accepter.accepter no incoming connection? sd = -1 errno 24 Too many open files 2014-06-08 22:15:35.154759 7f850ac25700 0 filestore(/srv/ceph/osd/ceph-20) write couldn't open 86.37_head/a63e7df7/rbd_data.1933fe2ae8944a.042c/head//86: (24) Too many open files 2014-06-08 22:15:35.159074 7f850ac25700 0 filestore(/srv/ceph/osd/ceph-20) error (24) Too many open files not handled on operation 10 (488954466.1.0, or op 0, counting from 0) 2014-06-08 22:15:35.159095 7f850ac25700 0 filestore(/srv/ceph/osd/ceph-20) unexpected error code 2014-06-08 22:15:35.159098 7f850ac25700 0 filestore(/srv/ceph/osd/ceph-20) transaction dump: { ops: [ { op_num: 0, op_name: write, collection: 86.37_head, oid: a63e7df7\/rbd_data.1933fe2ae8944a.042c\/head\/\/86, length: 4096, offset: 3104768, bufferlist length: 4096}, { op_num: 1, op_name: setattr, collection: 86.37_head, oid: a63e7df7\/rbd_data.1933fe2ae8944a.042c\/head\/\/86, name: _, length: 251}, { op_num: 2, op_name: setattr, collection: 86.37_head, oid: a63e7df7\/rbd_data.1933fe2ae8944a.042c\/head\/\/86, name: snapset, length: 31}]} 2014-06-08 22:15:35.255955 7f850ac25700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f850ac25700 time 2014-06-08 22:15:35.191181 os/FileStore.cc: 2448: FAILED assert(0 == unexpected error) -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] FAILED assert(_size = 0) during recovery - need to understand what's going on
, (boost::statechart::history_mode)0::shallow_construct(boost::intrusive_ptrPG::RecoveryState::Primary const, boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator)+0x4f) [0x83a3fa] 5: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_statePG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0::transitPG::RecoveryState::Active()+0xa4) [0x83a5c8] 6: (boost::statechart::simple_statePG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base const, void const*)+0x16a) [0x83a8ae] 7: (boost::statechart::simple_statePG::RecoveryState::WaitFlushedPeering, PG::RecoveryState::Peering, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base const, void const*)+0x84) [0x837e0a] 8: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator::process_queued_events()+0xf2) [0x81abe0] 9: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator::process_event(boost::statechart::event_base const)+0x1e) [0x81ae24] 10: (PG::handle_peering_event(std::tr1::shared_ptrPG::CephPeeringEvt, PG::RecoveryCtx*)+0x2fb) [0x7d6303] 11: (OSD::process_peering_events(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x320) [0x64cdde] 12: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x16) [0x6aa06a] 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x569) [0x9cf66f] 14: (ThreadPool::WorkThread::entry()+0x10) [0x9d10b6] 15: (()+0x7b77) [0x7f0ee292bb77] 16: (clone()+0x6d) [0x7f0ee0c6368d] We finally managed to restart all 3 affected OSDs, but we got corrupted filesystems inside the VMs as well as scrub errors afterwards. How can this be? Isn't Ceph designed to handle network failures? Obviously, running nf_conntrack on Ceph hosts is not a brilliant idea but it simply was present here. But I don't think that dropping network packets should lead to corrupted data. Am I right? Any hints on what could be wrong here are appreciated! I don't like to run into a similar situation again. TIA Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] smart replication
Am 19.02.2014 12:01, schrieb Pavel V. Kaygorodov: Is it possible to do this with ceph? If yes, how to configure this? I think this can be achieved through multiple CRUSH rulesets. There is an example in the docs which explains how to differentiate between SSD and non-SSD storage: http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds The principles shown here can possibly adapted to you use case. Regards Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Removing OSD, double data migration
Am 12.02.2014 20:27, schrieb Michael: Have always wondered this, why does data get shuffled twice when you delete an OSD? You out an OSD and the data gets moved to other nodes - understandable but then when you remove that OSD from crush it moves data again, aren't outed OSD's and an OSD's not in crush the same from a data position point of view? What data is being moved when a fully outed OSD is then removed from crush? I second this. When I adhere to the OSD removal how-to[1], I see heavy data migration taking place twice. This is a nuisance. The last time I had to take an OSD out of a cluster, I marked it out and removed it from the CRUSH map at the same time. Don't know if this is the recommended way but it seemed to work. Regards Christian [1] http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] filesystem fragmentation on ext4 OSD
Am 06.02.2014 16:24, schrieb Mark Nelson: Hi Christian, can you tell me a little bit about how you are using Ceph and what kind of IO you are doing? Just forgot to mention: we're running Ceph 0.72.2 on Linux 3.10 (both storage servers and inside VMs) and Qemu-KVM 1.5.3. Regards Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] filesystem fragmentation on ext4 OSD
Am 07.02.2014 14:42, schrieb Mark Nelson: Ok, so the reason I was wondering about the use case is if you were doing RBD specifically. Fragmentation has been something we've periodically kind of battled with but still see in some cases. BTRFS especially can get pretty spectacularly fragmented due to COW and overwrites. There's a thread from a couple of weeks ago called rados io hints that you may want to look at/contribute to. Thank you for the hint. Sage's proposal on ceph-devel sounds good, so I'll wait for an implementation. Regards Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] filesystem fragmentation on ext4 OSD
Hi, after running Ceph for a while I see a lot of fragmented files on our OSD filesystems (all running ext4). For example: itchy ~ # fsck -f /srv/ceph/osd/ceph-5 fsck von util-linux 2.22.2 e2fsck 1.42 (29-Nov-2011) [...] /dev/mapper/vgosd00-ceph--osd00: 461903/418119680 files (33.7% non-contiguous), 478239460/836229120 blocks This is an unusually high value for ext4. The normal expectation is something in the 5% range. I suspect that such a high fragmentation produces lots of unnecessary seeks on the disks. Has anyone an idea what to do to make Ceph fragment an OSD filesystem less? TIA Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] filesystem fragmentation on ext4 OSD
Am 06.02.2014 16:24, schrieb Mark Nelson: Hi Christian, can you tell me a little bit about how you are using Ceph and what kind of IO you are doing? Sure. We're using it almost exclusively for serving VM images that are accessed from Qemu's built-in RBD client. The VMs themselves perform a very wide range of I/O types, from servers that write mainly log files to ZEO database servers with nearly completely random I/O. Many VMs have slowly increasing storage utilization. A reason could be that the OSDs issue syncfs() calls and ext4 cuts FS extents from just what has been written so far. But I'm not sure about the exact pattern of OSD/filesystem interaction. HTH Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Performance
Am 09.01.2014 10:25, schrieb Bradley Kite: 3 servers (quad-core CPU, 16GB RAM), each with 4 SATA 7.2K RPM disks (4TB) plus a 160GB SSD. [...] By comparison, a 12-disk RAID5 iscsi SAN is doing ~4000 read iops and ~2000 iops write (but with 15KRPM SAS disks). I think that comparing Ceph on 7.2k rpm SATA disks against iSCSI on 15k rpm SAS disks is not fair. The random access times of 15k SAS disks are hugely better compared to 7.2k SATA disks. What would be far more interesting is to compare Ceph against iSCSI with identical disks. Regards Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com