0.55 crashed during upgrade to bobtail
Hi, All osds in the dev cluster died shortly after upgrade (packet-only, i.e. binary upgrade, even without restart running processes), please see attached file. Was: 0.55.1-356-g850d1d5 Upgraded to: 0.56 tag The only one difference is a version of the gcc corresponding libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I may do a rollback and problem will eliminate with high probability, but seems there should be some fix. Also I have something simular in the the testing env days ago - packet upgrade inside 0.55 killed all _windows_ guests_ and one of tens of linux guests running above rbd. Unfortunately I have no debug sessions at this moment and I have only tail of the log from qemu: terminate called after throwing an instance of 'ceph::buffer::end_of_buffer' what(): buffer::end_of_buffer I`m blaming ldconfig action from librbd because nothing else `ll cause such case of destroy on the running processes - may be I`m wrong. thanks! WBR, Andrey crashes-2013-01-01.tgz Description: GNU Zip compressed data
Re: automatic repair of inconsistent pg?
OK thanks! Will change that. Am 31.12.2012 20:21, schrieb Samuel Just: The ceph-osd relies on fs barriers for correctness. You will want to remove the nobarrier option to prevent future corruption. -Sam On Mon, Dec 31, 2012 at 3:59 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 31.12.2012 02:10, schrieb Samuel Just: Are you using xfs? If so, what mount options? Yes, noatime,nodiratime,nobarrier,logbufs=8,logbsize=256k Stefan On Dec 30, 2012 1:28 PM, Stefan Priebe s.pri...@profihost.ag mailto:s.pri...@profihost.ag wrote: Am 30.12.2012 19:17, schrieb Samuel Just: This is somewhat more likely to have been a bug in the replication logic (there were a few fixed between 0.53 and 0.55). Had there been any recent osd failures? Yes i was stressing CEPH with failures (power, link, disk, ...). Stefan On Dec 24, 2012 10:55 PM, Sage Weil s...@inktank.com mailto:s...@inktank.com mailto:s...@inktank.com mailto:s...@inktank.com wrote: On Tue, 25 Dec 2012, Stefan Priebe wrote: Hello list, today i got the following ceph status output: 2012-12-25 02:57:00.632945 mon.0 [INF] pgmap v1394388: 7632 pgs: 7631 active+clean, 1 active+clean+inconsistent; 151 GB data, 307 GB used, 5028 GB / 5336 GB avail i then grepped the inconsistent pg by: # ceph pg dump - | grep inconsistent 3.ccf 10 0 0 0 41037824155930 155930 active+clean+inconsistent 2012-12-25 01:51:35.318459 6243'2107 6190'9847 [14,42] [14,42] 6243'2107 2012-12-25 01:51:35.318436 6007'2074 2012-12-23 01:51:24.386366 and initiated a repair: # ceph pg repair 3.ccf instructing pg 3.ccf on osd.14 to repair The log output then was: 2012-12-25 02:56:59.056382 osd.14 [ERR] 3.ccf osd.42 missing 1c602ccf/rbd_data.4904d6b8b4567.0b84/head//3 2012-12-25 02:56:59.056385 osd.14 [ERR] 3.ccf osd.42 missing ceb55ccf/rbd_data.48cc66b8b4567.1538/head//3 2012-12-25 02:56:59.097989 osd.14 [ERR] 3.ccf osd.42 missing dba6bccf/rbd_data.4797d6b8b4567.15ad/head//3 2012-12-25 02:56:59.097991 osd.14 [ERR] 3.ccf osd.42 missing a4deccf/rbd_data.45f956b8b4567.03d5/head//3 2012-12-25 02:56:59.098022 osd.14 [ERR] 3.ccf repair 4 missing, 0 inconsistent objects 2012-12-25 02:56:59.098046 osd.14 [ERR] 3.ccf repair 4 errors, 4 fixed Why doesn't ceph repair this automatically? Ho could this happen at all? We just made some fixes to repair in next (it was broken sometime between ~0.53 and 0.55). The latest next should repair it. In general we don't repair automatically lest we inadvertantly propagate bad data or paper over a bug. As for the original source of the missing objects... I'm not sure. There were some fixed races related to backfill that could lead to an object being missed, but Sam would know more about how likely that actually is. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org mailto:majord...@vger.kernel.org mailto:majord...@vger.kernel.org mailto:majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 0.55 crashed during upgrade to bobtail
On Tue, Jan 1, 2013 at 9:49 PM, Andrey Korolyov and...@xdel.ru wrote: Hi, All osds in the dev cluster died shortly after upgrade (packet-only, i.e. binary upgrade, even without restart running processes), please see attached file. Was: 0.55.1-356-g850d1d5 Upgraded to: 0.56 tag The only one difference is a version of the gcc corresponding libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I may do a rollback and problem will eliminate with high probability, but seems there should be some fix. Also I have something simular in the the testing env days ago - packet upgrade inside 0.55 killed all _windows_ guests_ and one of tens of linux guests running above rbd. Unfortunately I have no debug sessions at this moment and I have only tail of the log from qemu: terminate called after throwing an instance of 'ceph::buffer::end_of_buffer' what(): buffer::end_of_buffer I`m blaming ldconfig action from librbd because nothing else `ll cause such case of destroy on the running processes - may be I`m wrong. thanks! WBR, Andrey Sorry, I`m not able to reproduce crash after rollback and traces was uncomplete due to lack of disk space on specified core location, so please don`t mind it. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 0.55 crashed during upgrade to bobtail
On Wed, Jan 2, 2013 at 12:16 AM, Andrey Korolyov and...@xdel.ru wrote: On Tue, Jan 1, 2013 at 9:49 PM, Andrey Korolyov and...@xdel.ru wrote: Hi, All osds in the dev cluster died shortly after upgrade (packet-only, i.e. binary upgrade, even without restart running processes), please see attached file. Was: 0.55.1-356-g850d1d5 Upgraded to: 0.56 tag The only one difference is a version of the gcc corresponding libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I may do a rollback and problem will eliminate with high probability, but seems there should be some fix. Also I have something simular in the the testing env days ago - packet upgrade inside 0.55 killed all _windows_ guests_ and one of tens of linux guests running above rbd. Unfortunately I have no debug sessions at this moment and I have only tail of the log from qemu: terminate called after throwing an instance of 'ceph::buffer::end_of_buffer' what(): buffer::end_of_buffer I`m blaming ldconfig action from librbd because nothing else `ll cause such case of destroy on the running processes - may be I`m wrong. thanks! WBR, Andrey Sorry, I`m not able to reproduce crash after rollback and traces was uncomplete due to lack of disk space on specified core location, so please don`t mind it. Ahem, finally it seems that osd process stumbling on something on the fs, because my other environments also was able to reproduce crash once, but reproducing is not possible since new osd process started over existing filestore(offline version rollback and another try to online upgrade doing fine). And backtrace in first message is complete, at least 1 and 2, despite of lack of space first time - I have received a couple of coredumps which trace looks exactly the same as 2. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.56 released
Doh! Sorry about that. It looks like they are still in rpm-testing: http://www.ceph.com/rpm-testing/el6/x86_64/ I imagine Gary will have them in the non-testing repo some time tomorrow. Mark On 01/01/2013 08:24 PM, Dennis Jacobfeuerborn wrote: Hi, apparently the RPM link points to version 0.52. Where can the RPMs for 0.56 be found? Regards, Dennis On 01/01/2013 07:02 AM, Sage Weil wrote: We're bringing in the new year with a new release, v0.56, which will form the basis of the next stable series bobtail. There is little in the way of new functionality since v0.55, as we've been focusing primarily on stability, performance, and upgradability from the previous argonaut stable series (v0.48.x). If you are a current argonaut user, you can either upgrade now, or watch the Inktank blog for the bobtail announcement after some additional testing has been completed. If you are a v0.55 or v0.55.1 user, we recommend upgrading now. Notable changes since v0.55 include: * librbd: fixes for read-only pools for image cloning * osd: fix for mixing argonaut and post-v0.54 OSDs * osd: some recovery tuning * osd: fix for several scrub, recovery, and watch/notify races/bugs * osd: fix pool_stat_t backwawrd compatibility with pre-v0.41 clients * osd: experimental split support * mkcephfs: misc fixes for fs initialization, mounting * radosgw: usage and op logs off by default * radosgw: keystone authentication off by default * upstart: only enabled with 'upstart' file exists in daemon data directory * mount.fuse.ceph: allow mounting of ceph-fuse via /etc/fstab * config: always complain about config parsing errors * mon: fixed memory leaks, misc bugs * mds: many misc fixes Notable changes since v0.48.2 (argonaut): * auth: authentication is now on by default; see release notes! * osd: improved threading, small io performance * osd: deep scrubbing (verify object data) * osd: chunky scrubs (more efficient) * osd: improved performance during recovery * librbd: cloning support * librbd: fine-grained striping support * librbd: better caching * radosgw: improved Swift and S3 API coverage (POST, multi-object delete, striping) * radosgw: OpenStack Keystone integration * radosgw: efficient usage stats aggregation (for billing) * crush: improvements in distribution (still off by default; see CRUSH tunables) * ceph-fuse, mds: general stability improvements * release RPMs for OpenSUSE, SLES, Fedora, RHEL, CentOS * tons and bug fixes and small improvements across the board If you are upgrading from v0.55, there are no special upgrade instructions. If you are upgrading from an older version, please read the release notes. Authentication is now enabled by default, and if you do not adjust your ceph.conf accordingly before upgrading the system will not come up by itself. You can get this release from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.56.tar.gz * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian * For RPMs, see http://ceph.com/docs/master/install/rpm -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html