[ceph-users] cephfs modification time
Hi, I have a program that tails a file and this file is create on another machine some tail programs does not work because the modification time is not updated in the remote machines I've find this old thread http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/11001 it mentions the problem and suggest ntp sync I tried to re-sync ntp and restart the ceph cluster, but the issue persists do you know if it is possible to avoid this behavior ? thanks -lorieri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph as backend for Swift
It is not too difficult to get going, once you add various patches so it works: - missing __init__.py - Allow to set ceph.conf - Fix write issue: ioctx.write() does not return the written length - Add param to async_update call (for swift in Juno) There are a number of forks/pulls etc for these. Mine is here (fix_install branch): https://github.com/markir9/swift-ceph-backend/commits/fix_install However do take a look at the thread in the openstack list about this: http://www.gossamer-threads.com/lists/openstack/dev/43482 In particular the geo replication side of things needs more coding before it works. Cheers Mark On 09/01/15 15:51, Sebastien Han wrote: You can have a look of what I did here with Christian: * https://github.com/stackforge/swift-ceph-backend * https://github.com/enovance/swiftceph-ansible If you have further question just let us know. On 08 Jan 2015, at 15:51, Robert LeBlanc rob...@leblancnet.us wrote: Anyone have a reference for documentation to get Ceph to be a backend for Swift? Thanks, Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Cheers. Sébastien Han Cloud Architect Always give 100%. Unless you're giving blood. Phone: +33 (0)1 49 70 99 72 Mail: sebastien@enovance.com Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - Twitter : @enovance ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon problem after power failure
On 01/09/2015 04:31 PM, Jeff wrote: We had a power failure last night and our five node cluster has two nodes with mon's that fail to start. Here's what we see: # /usr/bin/ceph-mon --cluster=ceph -i ceph2 -f 2015-01-09 11:28:45.579267 b6c10740 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={6=support isa/lrc erasure code} 2015-01-09 11:28:45.606896 b6c10740 -1 error checking features: (1) Operation not permitted and # /usr/local/bin/ceph-mon --cluster=ceph -i ceph4 -f Corruption: 6 missing files; e.g.: /var/lib/ceph/mon/ceph-ceph4/store.db/4011258.ldb Corruption: 6 missing files; e.g.: /var/lib/ceph/mon/ceph-ceph4/store.db/4011258.ldb 2015-01-09 11:30:32.024445 b6ea1740 -1 failed to create new leveldb store Does anyone have any suggestions for how to get these two monitors running again? Recreate them. Only way I'm aware especially considering the leveldb corruption. -Joao -- Joao Eduardo Luis Software Engineer | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure coded PGs incomplete
Hi Italo, If you check for a post from me from a couple of days back, I have done exactly this. I created a k=5 m=3 over 4 hosts. This ensured that I could lose a whole host and then an OSD on another host and the cluster was still fully operational. I’m not sure if my method I used in the Crush map was the best way to achieve what I did, but it seemed to work. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Italo Santos Sent: 08 January 2015 22:35 To: Loic Dachary Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Erasure coded PGs incomplete Thanks for your answer. But another doubt raised… Suppose I have 4 hosts with a erasure pool created with k=3, m=1 and failure domain by host and I lost a host. On this case I’ll face with the same issue on the beginning of this thread because k+m number of hosts, right? - On this scenario, with one host less I still able to read and write data on cluster? - To solve the issue I’ll need add another host on cluster? Regards. Italo Santos http://italosantos.com.br/ http://italosantos.com.br/ On Wednesday, December 17, 2014 at 20:19, Loic Dachary wrote: On 17/12/2014 19:46, Italo Santos wrote: Understood. Thanks for your help, the cluster is healthy now :D Also, using for example k=6,m=1 and failure domain by host I’ll be able lose all OSD on the same host, but if a lose 2 disks on different hosts I can lose data right? So, it is possible been a failure domain which allow me to lose an OSD or a host? That's actually a good way to put it :-) Regards. *Italo Santos* http://italosantos.com.br/ On Wednesday, December 17, 2014 at 4:27 PM, Loic Dachary wrote: On 17/12/2014 19:22, Italo Santos wrote: Loic, So, if want have a failure domain by host, I’ll need set up a erasure profile which k+m = total number of hosts I have, right? Yes, k+m has to be = number of hosts. Regards. *Italo Santos* http://italosantos.com.br/ On Wednesday, December 17, 2014 at 3:24 PM, Loic Dachary wrote: On 17/12/2014 18:18, Italo Santos wrote: Hello, I’ve take a look to this documentation (which help a lot) and if I understand right, when I set a profile like: === ceph osd erasure-code-profile set isilon k=8 m=2 ruleset-failure-domain=host === And create a pool following the recommendations on doc, I’ll need (100*16)/2 = 800 PGs, I’ll need the sufficient number of hosts to support create total PGs? You will need k+m = 10 host per OSD. If you only have 10 hosts that should be ok and the 800 PGs will use these 10 OSD in various orders. It also means that you will end up having 800 PG per OSD which is a bit too mche. If you have 20 OSDs that will be better : each PG will get 10 OSD out of 20 and each OSD will have 400 PGs. Ideally you want the number of PG per OSD to be in the range (approximately) [20,300]. Cheers Regards. *Italo Santos* http://italosantos.com.br/ On Wednesday, December 17, 2014 at 2:42 PM, Loic Dachary wrote: Hi, Thanks for the update : good news are much appreciated :-) Would you have time to review the documentation at https://github.com/ceph/ceph/pull/3194/files ? It was partly motivated by the problem you had. Cheers On 17/12/2014 14:03, Italo Santos wrote: Hello Loic, Thanks for you help, I’ve take a look to my crush map and I replace step chooseleaf indep 0 type osd” by step choose indep 0 type osd” and all PGs was created successfully. At. *Italo Santos* http://italosantos.com.br/ On Tuesday, December 16, 2014 at 8:39 PM, Loic Dachary wrote: Hi, The 2147483647 means that CRUSH did not find enough OSD for a given PG. If you check the crush rule associated with the erasure coded pool, you will most probably find why. Cheers On 16/12/2014 23:32, Italo Santos wrote: Hello, I'm trying to create an erasure pool following http://docs.ceph.com/docs/master/rados/operations/erasure-code/, but when I try create a pool with a specifc erasure-code-profile (myprofile) the PGs became on incomplete state. Anyone can help me? Below the profile I created: root@ceph0001:~# ceph osd erasure-code-profile get myprofile directory=/usr/lib/ceph/erasure-code k=6 m=2 plugin=jerasure technique=reed_sol_van The status of cluster: root@ceph0001:~# ceph health HEALTH_WARN 12 pgs incomplete; 12 pgs stuck inactive; 12 pgs stuck unclean health detail: root@ceph0001:~# ceph health detail HEALTH_WARN 12 pgs incomplete; 12 pgs stuck inactive; 12 pgs stuck unclean pg 2.9 is stuck inactive since forever, current state incomplete, last acting [4,10,15,2147483647,3,2147483647,2147483647,2147483647] pg 2.8 is stuck inactive since forever, current state incomplete, last acting [0,2147483647,4,2147483647,10,2147483647,15,2147483647] pg 2.b is stuck inactive since forever, current state incomplete, last acting
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Hi Lionel, we have a ceph cluster with in sum about 1PB, 12 OSDs with 60 Disks, devided into 4 racks in 2 rooms, all connected with a dedicated 10G cluster network. Of course with a replication level of 3. We did about 9 Month intensive testing. Just like you, we were never experiences that kind of problems before. And incomplete PG was recovering as soon as at least one OSD holding a copy of it came back up. We still don't know what caused this specific error, but at no point there were more than two hosts down at the same time. Our pool has a min_size of 1. And after everything was up again, we had completely LOST 2 of 3 pg copies (the directories on the OSDs were empty) and the third copy was obvioulsy broken, because even manually injecting this pg into the other osds didn't changed anything. My main problem here is, that with even one incomplete PG your pool is rendered unusable. And there is currently no way to make ceph forget about the data of this pg and create it as an empty one. So the only way to make this pool usable again is to loose all your data in there. Which for me is just not acceptable. Regards, Christian Am 07.01.2015 21:10, schrieb Lionel Bouton: On 12/30/14 16:36, Nico Schottelius wrote: Good evening, we also tried to rescue data *from* our old / broken pool by map'ing the rbd devices, mounting them on a host and rsync'ing away as much as possible. However, after some time rsync got completly stuck and eventually the host which mounted the rbd mapped devices decided to kernel panic at which time we decided to drop the pool and go with a backup. This story and the one of Christian makes me wonder: Is anyone using ceph as a backend for qemu VM images in production? Yes with Ceph 0.80.5 since September after extensive testing over several months (including an earlier version IIRC) and some hardware failure simulations. We plan to upgrade one storage host and one monitor to 0.80.7 to validate this version over several months too before migrating the others. And: Has anyone on the list been able to recover from a pg incomplete / stuck situation like ours? Only by adding back an OSD with the data needed to reach min_size for said pg, which is expected behavior. Even with some experimentations with isolated unstable OSDs I've not yet witnessed a case where Ceph lost multiple replicates simultaneously (we lost one OSD to disk failure and another to a BTRFS bug but without trying to recover the filesystem so we might have been able to recover this OSD). If your setup is susceptible to situations where you can lose all replicates you will lose data but there's not much that can be done about that. Ceph actually begins to generate new replicates to replace the missing onesaftermon osd down out interval so the actual loss should not happen unless you lose (and can't recover) size OSDs on separate hosts (with default crush map) simultaneously. Before going in production you should know how long Ceph will take to fully recover from a disk or host failure by testing it with load. Your setup might not be robust if it hasn't the available disk space or the speed needed to recover quickly from such a failure. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Documentation of ceph pg num query
On Fri, Jan 9, 2015 at 1:24 AM, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi all, as mentioned last year, our ceph cluster is still broken and unusable. We are still investigating what has happened and I am taking more deep looks into the output of ceph pg pgnum query. The problem is that I can find some informations about what some of the sections mean, but mostly I can only guess. Is there any kind of documentation where I can find some explanations of whats state there? Because without that the output is barely usefull. There is unfortunately not really documentation around this right now. If you have specific questions someone can probably help you with them, though. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uniform distribution
100GB objects (or ~40 on a hard drive!) are way too large for you to get an effective random distribution. -Greg On Thu, Jan 8, 2015 at 5:25 PM, Mark Nelson mark.nel...@inktank.com wrote: On 01/08/2015 03:35 PM, Michael J Brewer wrote: Hi all, I'm working on filling a cluster to near capacity for testing purposes. Though I'm noticing that it isn't storing the data uniformly between OSDs during the filling process. I currently have the following levels: Node 1: /dev/sdb1 3904027124 2884673100 1019354024 74% /var/lib/ceph/osd/ceph-0 /dev/sdc1 3904027124 2306909388 1597117736 60% /var/lib/ceph/osd/ceph-1 /dev/sdd1 3904027124 3296767276 607259848 85% /var/lib/ceph/osd/ceph-2 /dev/sde1 3904027124 3670063612 233963512 95% /var/lib/ceph/osd/ceph-3 Node 2: /dev/sdb1 3904027124 3250627172 653399952 84% /var/lib/ceph/osd/ceph-4 /dev/sdc1 3904027124 3611337492 292689632 93% /var/lib/ceph/osd/ceph-5 /dev/sdd1 3904027124 2831199600 1072827524 73% /var/lib/ceph/osd/ceph-6 /dev/sde1 3904027124 2466292856 1437734268 64% /var/lib/ceph/osd/ceph-7 I am using rados put to upload 100g files to the cluster, doing two at a time from two different locations. Is this expected behavior, or can someone shed light on why it is doing this? We're using the opensource version 80.7. We're also using the default CRUSH configuration. So crush utilizes pseudo-random distributions, but sadly random distributions tend to be clumpy and not perfectly uniform until you get to very high sample counts. The gist of it is that if you have a really low density of PGs/OSD and/or are very unlucky, you can end up with a skewed distribution. If you are even more unlucky, you could compound that with a streak of objects landing on PGs associated with some specific OSD. This particular case looks rather bad. How many PGs and OSDs do you have? Regards, *MICHAEL J. BREWER* *Phone:* 1-512-286-5596 | *Tie-Line:* 363-5596* E-mail:*_mjbre...@us.ibm.com_ mailto:mjbre...@us.ibm.com 11501 Burnet Rd Austin, TX 78758-3400 United States ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph on peta scale
On Thu, Jan 8, 2015 at 5:46 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote: I just finished configuring ceph up to 100 TB with openstack ... Since we are also using Lustre in our HPC machines , just wondering what is the bottle neck in ceph going on Peta Scale like Lustre . any idea ? or someone tried it If you're talking about people building a petabyte Ceph system, there are *many* who run clusters of that size. If you're talking about the Ceph filesystem as a replacement for Lustre at that scale, the concern is less about the raw amount of data and more about the resiliency of the current code base at that size...but if you want to try it out and tell us what problems you run into we will love you forever. ;) (The scalable file system use case is what actually spawned the Ceph project, so in theory there shouldn't be any serious scaling bottlenecks. In practice it will depend on what kind of metadata throughput you need because the multi-MDS stuff is improving but still less stable.) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph configuration on multiple public networks.
Hi, We've setup ceph and openstack on a fairly peculiar network configuration (or at least I think it is) and I'm looking for information on how to make it work properly. Basically, we have 3 networks, a management network, a storage network and a cluster network. The management network is over a 1 gbps link, while the storage network is over 2 bonded 10 gbps links. The cluster network can be ignored for now, as it works well. Now, the main problem is that ceph osd nodes are plugged on the management, storage and cluster networks, but the monitors are only plugged on the management network. When I do tests, I see that all the traffic ends up going through the management network, slowing down ceph's performances. Because of the current network setup, I can't hook up the monitoring nodes on the storage network, as we're missing ports on the switch. Would it be possible to maintain access to the management nodes while forcing the ceph cluster to use the storage network for data transfer? As a reference, here's my ceph.conf. [global] osd_pool_default_pgp_num = 800 osd_pg_bits = 12 auth_service_required = cephx osd_pool_default_size = 3 filestore_xattr_use_omap = true auth_client_required = cephx osd_pool_default_pg_num = 800 auth_cluster_required = cephx mon_host = 10.251.0.51 public_network = 10.251.0.0/24, 10.21.0.0/24 mon_initial_members = cephmon1 cluster_network = 192.168.31.0/24 fsid = 60e1b557-e081-4dab-aa76-e68ba38a159e osd_pgp_bits = 12 As you can see I've setup 2 public networks, 10.251.0.0 being the management network and 10.21.0.0 being the storage network. Would it be possible to maintain cluster functionality and remove 10.251.0.0/24 from the public_network list? For example, if I were to remove it from the public network list and referenced each monitor node IP in the config file, would I be able to maintain connectivity? -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] backfill_toofull, but OSDs not full
Hi, I had an similiar effect two weeks ago - 1PG backfill_toofull and due reweighting and delete there was enough free space but the rebuild process stopped after a while. After stop and start ceph on the second node, the rebuild process runs without trouble and the backfill_toofull are gone. This happens with firefly. Udo On 09.01.2015 21:29, c3 wrote: In this case the root cause was half denied reservations. http://tracker.ceph.com/issues/9626 This stopped backfills since, those listed as backfilling were actually half denied and doing nothing. The toofull status is not checked until a free backfill slot happens, so everything was just stuck. Interestingly, the toofull was created by other backfills which were not stoppped. http://tracker.ceph.com/issues/9594 Quite the log jam to clear. Quoting Craig Lewis cle...@centraldesktop.com: What was the osd_backfill_full_ratio? That's the config that controls backfill_toofull. By default, it's 85%. The mon_osd_*_ratio affect the ceph status. I've noticed that it takes a while for backfilling to restart after changing osd_backfill_full_ratio. Backfilling usually restarts for me in 10-15 minutes. Some PGs will stay in that state until the cluster is nearly done recoverying. I've only seen backfill_toofull happen after the OSD exceeds the ratio (so it's reactive, no proactive). Mine usually happen when I'm rebalancing a nearfull cluster, and an OSD backfills itself toofull. On Mon, Jan 5, 2015 at 11:32 AM, c3 ceph-us...@lopkop.com wrote: Hi, I am wondering how a PG gets marked backfill_toofull. I reweighted several OSDs using ceph osd crush reweight. As expected, PG began moving around (backfilling). Some PGs got marked +backfilling (~10), some +wait_backfill (~100). But some are marked +backfill_toofull. My OSDs are between 25% and 72% full. Looking at ceph pg dump, I can find the backfill_toofull PGs and verified the OSDs involved are less than 72% full. Do backfill reservations include a size? Are these OSDs projected to be toofull, once the current backfilling complete? Some of the backfill_toofull and backfilling point to the same OSDs. I did adjust the full ratios, but that did not change the backfill_toofull status. ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95' ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92' ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RHEL 7 Installs
Ken, I had a number of issues installing Ceph on RHEL 7, which I think are mostly due to dependencies. I followed the quick start guide, which gets the latest major release--e.g., Firefly, Giant. ceph.conf is here: http://goo.gl/LNjFp3 ceph.log common errors included: http://goo.gl/yL8UsM To resolve these, I had to download and install libunwind and python-jinja2. It also seems that the Giant repo had 0.86 and 0.87 packages for python-ceph, and ceph-deploy didn't like that. ceph.log error: http://goo.gl/oeKGUv To resolve this, I had to download and install python-ceph v0.87. Then, run the ceph-deploy install command again. -- John Wilkins Red Hat jowil...@redhat.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Hi Nico, I would probably recommend to upgrade to 0.87 (giant). I am running this version for some time now and it works very well. I also upgraded from firefly and it was easy. The issue you are experiencing seems quite complex and it would require debug logs to troubleshoot. Apology that I did not help much. -Jiri On 9/01/2015 20:23, Nico Schottelius wrote: Good morning Jiri, sure, let me catch up on this: - Kernel 3.16 - ceph: 0.80.7 - fs: xfs - os: debian (backports) (1x)/ubuntu (2x) Cheers, Nico Jiri Kanicky [Fri, Jan 09, 2015 at 10:44:33AM +1100]: Hi Nico. If you are experiencing such issues it would be good if you provide more info about your deployment: ceph version, kernel versions, OS, filesystem btrfs/xfs. Thx Jiri - Reply message - From: Nico Schottelius nico-eph-us...@schottelius.org To: ceph-users@lists.ceph.com Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable] Date: Wed, Dec 31, 2014 02:36 Good evening, we also tried to rescue data *from* our old / broken pool by map'ing the rbd devices, mounting them on a host and rsync'ing away as much as possible. However, after some time rsync got completly stuck and eventually the host which mounted the rbd mapped devices decided to kernel panic at which time we decided to drop the pool and go with a backup. This story and the one of Christian makes me wonder: Is anyone using ceph as a backend for qemu VM images in production? And: Has anyone on the list been able to recover from a pg incomplete / stuck situation like ours? Reading about the issues on the list here gives me the impression that ceph as a software is stuck/incomplete and has not yet become ready clean for production (sorry for the word joke). Cheers, Nico Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Documentation of ceph pg num query
Have you looked at http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/ http://ceph.com/docs/master/rados/operations/pg-states/ http://ceph.com/docs/master/rados/operations/pg-concepts/ On Fri, Jan 9, 2015 at 1:24 AM, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi all, as mentioned last year, our ceph cluster is still broken and unusable. We are still investigating what has happened and I am taking more deep looks into the output of ceph pg pgnum query. The problem is that I can find some informations about what some of the sections mean, but mostly I can only guess. Is there any kind of documentation where I can find some explanations of whats state there? Because without that the output is barely usefull. Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- John Wilkins Red Hat jowil...@redhat.com (415) 425-9599 http://redhat.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uniform distribution
I didn't actually calculate the per-OSD object density but yes, I agree that will hurt. On 01/09/2015 12:09 PM, Gregory Farnum wrote: 100GB objects (or ~40 on a hard drive!) are way too large for you to get an effective random distribution. -Greg On Thu, Jan 8, 2015 at 5:25 PM, Mark Nelson mark.nel...@inktank.com wrote: On 01/08/2015 03:35 PM, Michael J Brewer wrote: Hi all, I'm working on filling a cluster to near capacity for testing purposes. Though I'm noticing that it isn't storing the data uniformly between OSDs during the filling process. I currently have the following levels: Node 1: /dev/sdb1 3904027124 2884673100 1019354024 74% /var/lib/ceph/osd/ceph-0 /dev/sdc1 3904027124 2306909388 1597117736 60% /var/lib/ceph/osd/ceph-1 /dev/sdd1 3904027124 3296767276 607259848 85% /var/lib/ceph/osd/ceph-2 /dev/sde1 3904027124 3670063612 233963512 95% /var/lib/ceph/osd/ceph-3 Node 2: /dev/sdb1 3904027124 3250627172 653399952 84% /var/lib/ceph/osd/ceph-4 /dev/sdc1 3904027124 3611337492 292689632 93% /var/lib/ceph/osd/ceph-5 /dev/sdd1 3904027124 2831199600 1072827524 73% /var/lib/ceph/osd/ceph-6 /dev/sde1 3904027124 2466292856 1437734268 64% /var/lib/ceph/osd/ceph-7 I am using rados put to upload 100g files to the cluster, doing two at a time from two different locations. Is this expected behavior, or can someone shed light on why it is doing this? We're using the opensource version 80.7. We're also using the default CRUSH configuration. So crush utilizes pseudo-random distributions, but sadly random distributions tend to be clumpy and not perfectly uniform until you get to very high sample counts. The gist of it is that if you have a really low density of PGs/OSD and/or are very unlucky, you can end up with a skewed distribution. If you are even more unlucky, you could compound that with a streak of objects landing on PGs associated with some specific OSD. This particular case looks rather bad. How many PGs and OSDs do you have? Regards, *MICHAEL J. BREWER* *Phone:* 1-512-286-5596 | *Tie-Line:* 363-5596* E-mail:*_mjbre...@us.ibm.com_ mailto:mjbre...@us.ibm.com 11501 Burnet Rd Austin, TX 78758-3400 United States ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow/Hung IOs
I doesn't seem like the problem here, but I've noticed that slow OSDs have a large fan-out. I have less than 100 OSDs, so every OSD talks to every other OSD in my cluster. I was getting slow notices from all of my OSDs. Nothing jumped out, so I started looking at disk write latency graphs. I noticed that all the OSDs in one node had 10x the write latency of the other nodes. After that, I graphed the number of slow notices per OSD, and noticed that a much higher number of slow requests on that node. Long story short, I lost a battery on my write cache. But it wasn't at all obvious from the slow request notices, not until I dug deeper. On Mon, Jan 5, 2015 at 4:07 PM, Sanders, Bill bill.sand...@teradata.com wrote: Thanks for the reply. 14 and 18 happened to show up during that run, but its certainly not only those OSD's. It seems to vary each run. Just from the runs I've done today I've seen the following pairs of OSD's: ['0,13', '0,18', '0,24', '0,25', '0,32', '0,34', '0,36', '10,22', '11,30', '12,28', '13,30', '14,22', '14,24', '14,27', '14,30', '14,31', '14,33', '14,34', '14,35', '14,39', '16,20', '16,27', '18,38', '19,30', '19,31', '19,39', '20,38', '22,30', '26,37', '26,38', '27,33', '27,34', '27,36', '28,32', '28,34', '28,36', '28,37', '3,18', '3,27', '3,29', '3,37', '4,10', '4,29', '5,19', '5,37', '6,25', '9,28', '9,29', '9,37'] Which is almost all of the OSD's in the system. Bill -- *From:* Lincoln Bryant [linco...@uchicago.edu] *Sent:* Monday, January 05, 2015 3:40 PM *To:* Sanders, Bill *Cc:* ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] Slow/Hung IOs Hi BIll, From your log excerpt, it looks like your slow requests are happening on OSDs 14 and 18. Is it always these two OSDs? If you don't have a long recovery time (e.g., the cluster is just full of test data), maybe you could try setting OSDs 14 and 18 out and re-benching? Alternatively I suppose you could just use bonnie++ or dd etc to write to those OSDs (careful to not clobber any Ceph dirs) and see how the performance looks. Cheers, Lincoln On Jan 5, 2015, at 4:36 PM, Sanders, Bill wrote: Hi Ceph Users, We've got a Ceph cluster we've built, and we're experiencing issues with slow or hung IO's, even running 'rados bench' on the OSD cluster. Things start out great, ~600 MB/s, then rapidly drops off as the test waits for IO's. Nothing seems to be taxed... the system just seems to be waiting. Any help trying to figure out what could cause the slow IO's is appreciated. For example, 'rados -p rbd bench 60 write -t 32' takes over 900s to complete: A typical rados bench: Total time run: 957.458274 Total writes made: 9251 Write size: 4194304 Bandwidth (MB/sec): 38.648 Stddev Bandwidth: 157.323 Max bandwidth (MB/sec): 964 Min bandwidth (MB/sec): 0 Average Latency:3.21126 Stddev Latency: 51.9546 Max latency:910.72 Min latency:0.04516 According to ceph.log, we're not experiencing any OSD flapping or monitor election cycles, just slow requests: # grep slow /var/log/ceph/ceph.log: 2015-01-05 13:42:42.937678 osd.18 39.7.48.7:6803/11185 220 : [WRN] 3 slow requests, 1 included below; oldest blocked for 513.611379 secs 2015-01-05 13:42:42.937685 osd.18 39.7.48.7:6803/11185 221 : [WRN] slow request 30.136429 seconds old, received at 2015-01-05 13:42:12.801205: osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.000f [write 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops from 3,37 2015-01-05 13:42:49.938681 osd.18 39.7.48.7:6803/11185 222 : [WRN] 3 slow requests, 1 included below; oldest blocked for 520.612372 secs 2015-01-05 13:42:49.938688 osd.18 39.7.48.7:6803/11185 223 : [WRN] slow request 480.636547 seconds old, received at 2015-01-05 13:34:49.302080: osd_op(client.92008.1:3100010 rb.0.140d.238e1f29.0c77 [write 3622400~512] 3.d031a69f ondisk+write e994) v4 currently waiting for subops from 26,37 2015-01-05 13:43:12.941838 osd.18 39.7.48.7:6803/11185 224 : [WRN] 3 slow requests, 1 included below; oldest blocked for 543.615545 secs 2015-01-05 13:43:12.941844 osd.18 39.7.48.7:6803/11185 225 : [WRN] slow request 60.140595 seconds old, received at 2015-01-05 13:42:12.801205: osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.000f [write 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops from 3,37 2015-01-05 13:44:04.933440 osd.14 39.7.48.7:6818/11640 251 : [WRN] 4 slow requests, 1 included below; oldest blocked for 606.941954 secs 2015-01-05 13:44:04.933469 osd.14 39.7.48.7:6818/11640 252 : [WRN] slow request 240.101138 seconds old, received at 2015-01-05 13:40:04.832272: osd_op(client.92008.1:3101102 rb.0.142b.238e1f29.0010 [write 475136~512] 3.5e623815 ondisk+write e994) v4 currently waiting for subops from 27,33 2015-01-05 13:44:12.950805 osd.18
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
On Fri, Jan 9, 2015 at 2:00 AM, Nico Schottelius nico-ceph-us...@schottelius.org wrote: Lionel, Christian, we do have the exactly same trouble as Christian, namely Christian Eichelmann [Fri, Jan 09, 2015 at 10:43:20AM +0100]: We still don't know what caused this specific error... and ...there is currently no way to make ceph forget about the data of this pg and create it as an empty one. So the only way to make this pool usable again is to loose all your data in there. I wonder what is the position of ceph developers regarding dropping (emptying) specific pgs? Is that a use case that was never thought of or tested? I've never worked directly on any of the cluster this has happened to, but I believe every time we've seen issues like this with somebody we have a relationship with it's either: 1) been resolved by using the existing tools to stuff lost, or 2) been the result of local filesystems/disks silently losing data due to some fault or other. The second case means the OSDs have corrupted state and trusting them is tricky. Also, most people we've had relationships with that this has happened to really want to not lose all the data in the PG, which necessitates manually mucking around anyway. ;) Mailing list issues are obviously a lot harder to categorize, but the ones we've taken time on where people say the commands don't work have generally fallen into the second bucket. If you want to experiment, I think all the manual mucking around has been done with the objectstore tool and removing bad PGs, moving them around, or faking journal entries, but I've not done it myself so I could be mistaken. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] backfill_toofull, but OSDs not full
What was the osd_backfill_full_ratio? That's the config that controls backfill_toofull. By default, it's 85%. The mon_osd_*_ratio affect the ceph status. I've noticed that it takes a while for backfilling to restart after changing osd_backfill_full_ratio. Backfilling usually restarts for me in 10-15 minutes. Some PGs will stay in that state until the cluster is nearly done recoverying. I've only seen backfill_toofull happen after the OSD exceeds the ratio (so it's reactive, no proactive). Mine usually happen when I'm rebalancing a nearfull cluster, and an OSD backfills itself toofull. On Mon, Jan 5, 2015 at 11:32 AM, c3 ceph-us...@lopkop.com wrote: Hi, I am wondering how a PG gets marked backfill_toofull. I reweighted several OSDs using ceph osd crush reweight. As expected, PG began moving around (backfilling). Some PGs got marked +backfilling (~10), some +wait_backfill (~100). But some are marked +backfill_toofull. My OSDs are between 25% and 72% full. Looking at ceph pg dump, I can find the backfill_toofull PGs and verified the OSDs involved are less than 72% full. Do backfill reservations include a size? Are these OSDs projected to be toofull, once the current backfilling complete? Some of the backfill_toofull and backfilling point to the same OSDs. I did adjust the full ratios, but that did not change the backfill_toofull status. ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95' ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92' ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] backfill_toofull, but OSDs not full
In this case the root cause was half denied reservations. http://tracker.ceph.com/issues/9626 This stopped backfills since, those listed as backfilling were actually half denied and doing nothing. The toofull status is not checked until a free backfill slot happens, so everything was just stuck. Interestingly, the toofull was created by other backfills which were not stoppped. http://tracker.ceph.com/issues/9594 Quite the log jam to clear. Quoting Craig Lewis cle...@centraldesktop.com: What was the osd_backfill_full_ratio? That's the config that controls backfill_toofull. By default, it's 85%. The mon_osd_*_ratio affect the ceph status. I've noticed that it takes a while for backfilling to restart after changing osd_backfill_full_ratio. Backfilling usually restarts for me in 10-15 minutes. Some PGs will stay in that state until the cluster is nearly done recoverying. I've only seen backfill_toofull happen after the OSD exceeds the ratio (so it's reactive, no proactive). Mine usually happen when I'm rebalancing a nearfull cluster, and an OSD backfills itself toofull. On Mon, Jan 5, 2015 at 11:32 AM, c3 ceph-us...@lopkop.com wrote: Hi, I am wondering how a PG gets marked backfill_toofull. I reweighted several OSDs using ceph osd crush reweight. As expected, PG began moving around (backfilling). Some PGs got marked +backfilling (~10), some +wait_backfill (~100). But some are marked +backfill_toofull. My OSDs are between 25% and 72% full. Looking at ceph pg dump, I can find the backfill_toofull PGs and verified the OSDs involved are less than 72% full. Do backfill reservations include a size? Are these OSDs projected to be toofull, once the current backfilling complete? Some of the backfill_toofull and backfilling point to the same OSDs. I did adjust the full ratios, but that did not change the backfill_toofull status. ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95' ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92' ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RHEL 7 Installs
Hi John, For the last part, there being two different versions of packages in Giant, I don't think that's the actual problem. What's really happening there is that python-ceph has been obsoleted by other packages that are getting picked up by Yum. See the line that says Package python-ceph is obsoleted by python-rados... It's the same deal as http://tracker.ceph.com/issues/10476 You could try the same fix there. On Fri, Jan 9, 2015 at 4:50 PM, John Wilkins john.wilk...@inktank.com wrote: Ken, I had a number of issues installing Ceph on RHEL 7, which I think are mostly due to dependencies. I followed the quick start guide, which gets the latest major release--e.g., Firefly, Giant. ceph.conf is here: http://goo.gl/LNjFp3 ceph.log common errors included: http://goo.gl/yL8UsM To resolve these, I had to download and install libunwind and python-jinja2. It also seems that the Giant repo had 0.86 and 0.87 packages for python-ceph, and ceph-deploy didn't like that. ceph.log error: http://goo.gl/oeKGUv To resolve this, I had to download and install python-ceph v0.87. Then, run the ceph-deploy install command again. -- John Wilkins Red Hat jowil...@redhat.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG num calculator live on Ceph.com
Very very good :) пт, 9 янв. 2015, 2:17, William Bloom (wibloom) wibl...@cisco.com: Awesome, thanks Michael. Regards William *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Michael J. Kidd *Sent:* Wednesday, January 07, 2015 2:09 PM *To:* ceph-us...@ceph.com *Subject:* [ceph-users] PG num calculator live on Ceph.com Hello all, Just a quick heads up that we now have a PG calculator to help determine the proper PG per pool numbers to achieve a target PG per OSD ratio. http://ceph.com/pgcalc Please check it out! Happy to answer any questions, and always welcome any feedback on the tool / verbiage, etc... As an aside, we're also working to update the documentation to reflect the best practices. See Ceph.com tracker for this at: http://tracker.ceph.com/issues/9867 Thanks! Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] question about S3 multipart upload ignores request headers
I patch the http://tracker.ceph.com/issues/8452 run s3 test suite and still is error; err log: ERROR: failed to get obj attrs, obj=test-client.0-31zepqoawd8dxfa-212:_multipart_mymultipart.2/0IQGoJ7hG8ZtTyfAnglChBO79HUsjeC.meta ret=-2 I found code that it may has problem: when function exec ret = get_obj_attrs(store, s, meta_obj, attrs, NULL, NULL); , whether should exec meta_obj.set_in_extra_data(true); before it. because meta_obj is in the extra bucket. baijia...@126.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Lionel, Christian, we do have the exactly same trouble as Christian, namely Christian Eichelmann [Fri, Jan 09, 2015 at 10:43:20AM +0100]: We still don't know what caused this specific error... and ...there is currently no way to make ceph forget about the data of this pg and create it as an empty one. So the only way to make this pool usable again is to loose all your data in there. I wonder what is the position of ceph developers regarding dropping (emptying) specific pgs? Is that a use case that was never thought of or tested? For us it is essential to be able to keep the pool/cluster running even in case we have lost pgs. Even though I do not like the fact that we lost a pg for an unknown reason, I would prefer ceph to handle that case to recover to the best possible situation. Namely I wonder if we can integrate a tool that shows which (parts of) rbd images would be affected by dropping a pg. That would give us the chance to selectively restore VMs in case this happens again. Cheers, Nico -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Thu, 8 Jan 2015 21:17:12 -0700 Robert LeBlanc wrote: On Thu, Jan 8, 2015 at 8:31 PM, Christian Balzer ch...@gol.com wrote: On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote: Which of course currently means a strongly consistent lockup in these scenarios. ^o^ That is one way of putting it If I had the time and more importantly the talent to help with code, I'd do so. Failing that, pointing out the often painful truth is something I can do. Slightly off-topic and snarky, that strong consistency is of course of limited use when in the case of a corrupted PG Ceph basically asks you to toss a coin. As in minor corruption, impossible for a mere human to tell which replica is the good one, because one OSD is down and the 2 remaining ones differ by one bit or so. This is where checksumming is supposed to come in. I think Sage has been leading that initiative. Yeah, I'm aware of that effort. Of course in the meantime even a very simple majority vote would be most welcome and helpful in nearly all cases (with 3 replicas available). One wonders if this is basically acknowledging that while offloading some things like checksums to the underlying layer/FS are desirable from a codebase/effort/complexity view, neither BTRFS or ZFS are fully production ready and won't be for some time. Basically, when an OSD reads an object it should be able to tell if there was bit rot by hashing what it just read and checking the MD5SUM that it did when it first received the object. If it doesn't match it can ask another OSD until it finds one that matches. This provides a number of benefits: 1. Protect against bit rot. Checked on read and on deep scrub. 2. Automatically recover the correct version of the object. 3. If the client computes the MD5SUM before it sent over the wire, the data can be guaranteed through the memory of several machines/devices/cables/etc. 4. Getting by with size 2 is less risky for those who really want to do that. With all these benefits, there is a trade-off associated with it, mostly CPU. However with the inclusion of AES in silicon, it may not be a huge issue now. But, I'm not a programmer and familiar with the aspect of the Ceph code to be authoritative in any way. Yup, all very useful and pertinent points. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Documentation of ceph pg num query
Hi all, as mentioned last year, our ceph cluster is still broken and unusable. We are still investigating what has happened and I am taking more deep looks into the output of ceph pg pgnum query. The problem is that I can find some informations about what some of the sections mean, but mostly I can only guess. Is there any kind of documentation where I can find some explanations of whats state there? Because without that the output is barely usefull. Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Minimum Cluster Install (ARM)
On Thu, 8 Jan 2015 01:35:03 + Garg, Pankaj wrote: Hi, I am trying to get a very minimal Ceph cluster up and running (on ARM) and I'm wondering what is the smallest unit that I can run rados-bench on ? Documentation at (http://ceph.com/docs/next/start/quick-ceph-deploy/) seems to refer to 4 different nodes. Admin Node, Monitor Node and 2 OSD only nodes. Can the Admin node be an x86 machine even if the deployment is ARM based? Or can the Admin Node and Monitor node co-exist. Finally, I'm assuming I can get by with only 1 independent OSD node. If that's possible, I can get by with 2 ARM systems only. Can someone please shed some light on whether this will work? You can do everything on one node even (need to set replica size to 1). However that will not realistically reflect reality at all and any benchmarks will be skewed (as always with very small clusters). In addition to that your systems need to be fast enough, meaning CPU for OSD and MON, fast storage for MON(OS) and sufficient RAM to handle everything. Again, learning some things about Ceph is feasible with a minimal cluster, but drawing conclusions about performance with what you have at hand will be tricky. Lastly, Ceph can get very compute intense (OSDs, particularly with small I/Os), so I'm skeptical that ARM will cut the mustard. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Thu, Jan 8, 2015 at 8:31 PM, Christian Balzer ch...@gol.com wrote: On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote: Which of course currently means a strongly consistent lockup in these scenarios. ^o^ That is one way of putting it Slightly off-topic and snarky, that strong consistency is of course of limited use when in the case of a corrupted PG Ceph basically asks you to toss a coin. As in minor corruption, impossible for a mere human to tell which replica is the good one, because one OSD is down and the 2 remaining ones differ by one bit or so. This is where checksumming is supposed to come in. I think Sage has been leading that initiative. Basically, when an OSD reads an object it should be able to tell if there was bit rot by hashing what it just read and checking the MD5SUM that it did when it first received the object. If it doesn't match it can ask another OSD until it finds one that matches. This provides a number of benefits: 1. Protect against bit rot. Checked on read and on deep scrub. 2. Automatically recover the correct version of the object. 3. If the client computes the MD5SUM before it sent over the wire, the data can be guaranteed through the memory of several machines/devices/cables/etc. 4. Getting by with size 2 is less risky for those who really want to do that. With all these benefits, there is a trade-off associated with it, mostly CPU. However with the inclusion of AES in silicon, it may not be a huge issue now. But, I'm not a programmer and familiar with the aspect of the Ceph code to be authoritative in any way. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Good morning Jiri, sure, let me catch up on this: - Kernel 3.16 - ceph: 0.80.7 - fs: xfs - os: debian (backports) (1x)/ubuntu (2x) Cheers, Nico Jiri Kanicky [Fri, Jan 09, 2015 at 10:44:33AM +1100]: Hi Nico. If you are experiencing such issues it would be good if you provide more info about your deployment: ceph version, kernel versions, OS, filesystem btrfs/xfs. Thx Jiri - Reply message - From: Nico Schottelius nico-eph-us...@schottelius.org To: ceph-users@lists.ceph.com Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable] Date: Wed, Dec 31, 2014 02:36 Good evening, we also tried to rescue data *from* our old / broken pool by map'ing the rbd devices, mounting them on a host and rsync'ing away as much as possible. However, after some time rsync got completly stuck and eventually the host which mounted the rbd mapped devices decided to kernel panic at which time we decided to drop the pool and go with a backup. This story and the one of Christian makes me wonder: Is anyone using ceph as a backend for qemu VM images in production? And: Has anyone on the list been able to recover from a pg incomplete / stuck situation like ours? Reading about the issues on the list here gives me the impression that ceph as a software is stuck/incomplete and has not yet become ready clean for production (sorry for the word joke). Cheers, Nico Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com