0.55 crashed during upgrade to bobtail

2013-01-01 Thread Andrey Korolyov
Hi,

All osds in the dev cluster died shortly after upgrade (packet-only,
i.e. binary upgrade, even without restart running processes), please
see attached file.

Was: 0.55.1-356-g850d1d5
Upgraded to: 0.56 tag

The only one difference is a version of the gcc corresponding
libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I
may do a rollback and problem will eliminate with high probability,
but seems there should be some fix. Also I have something simular in
the the testing env days ago -  packet upgrade inside 0.55 killed all
_windows_ guests_ and one of tens of linux guests running above rbd.
Unfortunately I have no debug sessions at this moment and I have only
tail of the log from qemu:

terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
  what():  buffer::end_of_buffer

I`m blaming ldconfig action from librbd because nothing else `ll cause
such case of destroy on the running processes - may be I`m wrong.

thanks!

WBR, Andrey


crashes-2013-01-01.tgz
Description: GNU Zip compressed data


Re: automatic repair of inconsistent pg?

2013-01-01 Thread Stefan Priebe

OK thanks! Will change that.
Am 31.12.2012 20:21, schrieb Samuel Just:

The ceph-osd relies on fs barriers for correctness.  You will want to
remove the nobarrier option to prevent future corruption.
-Sam

On Mon, Dec 31, 2012 at 3:59 AM, Stefan Priebe s.pri...@profihost.ag wrote:

Am 31.12.2012 02:10, schrieb Samuel Just:


Are you using xfs?  If so, what mount options?



Yes,
noatime,nodiratime,nobarrier,logbufs=8,logbsize=256k

Stefan



On Dec 30, 2012 1:28 PM, Stefan Priebe s.pri...@profihost.ag
mailto:s.pri...@profihost.ag wrote:
  
   Am 30.12.2012 19:17, schrieb Samuel Just:
  
   This is somewhat more likely to have been a bug in the replication
logic
   (there were a few fixed between 0.53 and 0.55).  Had there been any
   recent osd failures?
  
   Yes i was stressing CEPH with failures (power, link, disk, ...).
  
   Stefan
  
   On Dec 24, 2012 10:55 PM, Sage Weil s...@inktank.com
mailto:s...@inktank.com
   mailto:s...@inktank.com mailto:s...@inktank.com wrote:
  
   On Tue, 25 Dec 2012, Stefan Priebe wrote:
 Hello list,

 today i got the following ceph status output:
 2012-12-25 02:57:00.632945 mon.0 [INF] pgmap v1394388: 7632
pgs: 7631
 active+clean, 1 active+clean+inconsistent; 151 GB data, 307 GB
   used, 5028 GB /
 5336 GB avail


 i then grepped the inconsistent pg by:
 # ceph pg dump - | grep inconsistent
 3.ccf   10  0   0   0   41037824155930
 155930
 active+clean+inconsistent   2012-12-25 01:51:35.318459
6243'2107
 6190'9847   [14,42] [14,42] 6243'2107   2012-12-25
   01:51:35.318436
 6007'2074   2012-12-23 01:51:24.386366

 and initiated a repair:
 #  ceph pg repair 3.ccf
 instructing pg 3.ccf on osd.14 to repair

 The log output then was:
 2012-12-25 02:56:59.056382 osd.14 [ERR] 3.ccf osd.42 missing
 1c602ccf/rbd_data.4904d6b8b4567.0b84/head//3
 2012-12-25 02:56:59.056385 osd.14 [ERR] 3.ccf osd.42 missing
 ceb55ccf/rbd_data.48cc66b8b4567.1538/head//3
 2012-12-25 02:56:59.097989 osd.14 [ERR] 3.ccf osd.42 missing
 dba6bccf/rbd_data.4797d6b8b4567.15ad/head//3
 2012-12-25 02:56:59.097991 osd.14 [ERR] 3.ccf osd.42 missing
 a4deccf/rbd_data.45f956b8b4567.03d5/head//3
 2012-12-25 02:56:59.098022 osd.14 [ERR] 3.ccf repair 4 missing,
0
   inconsistent
 objects
 2012-12-25 02:56:59.098046 osd.14 [ERR] 3.ccf repair 4 errors,
4
   fixed

 Why doesn't ceph repair this automatically? Ho could this
happen
   at all?
  
   We just made some fixes to repair in next (it was broken sometime
   between
   ~0.53 and 0.55).  The latest next should repair it.  In general
we don't
   repair automatically lest we inadvertantly propagate bad data or
paper
   over a bug.
  
   As for the original source of the missing objects... I'm not sure.
 There
   were some fixed races related to backfill that could lead to an
object
   being missed, but Sam would know more about how likely that
actually is.
  
   sage
   --
   To unsubscribe from this list: send the line unsubscribe
ceph-devel in
   the body of a message to majord...@vger.kernel.org
mailto:majord...@vger.kernel.org
   mailto:majord...@vger.kernel.org

mailto:majord...@vger.kernel.org
   More majordomo info at http://vger.kernel.org/majordomo-info.html
  

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 0.55 crashed during upgrade to bobtail

2013-01-01 Thread Andrey Korolyov
On Tue, Jan 1, 2013 at 9:49 PM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,

 All osds in the dev cluster died shortly after upgrade (packet-only,
 i.e. binary upgrade, even without restart running processes), please
 see attached file.

 Was: 0.55.1-356-g850d1d5
 Upgraded to: 0.56 tag

 The only one difference is a version of the gcc corresponding
 libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I
 may do a rollback and problem will eliminate with high probability,
 but seems there should be some fix. Also I have something simular in
 the the testing env days ago -  packet upgrade inside 0.55 killed all
 _windows_ guests_ and one of tens of linux guests running above rbd.
 Unfortunately I have no debug sessions at this moment and I have only
 tail of the log from qemu:

 terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
   what():  buffer::end_of_buffer

 I`m blaming ldconfig action from librbd because nothing else `ll cause
 such case of destroy on the running processes - may be I`m wrong.

 thanks!

 WBR, Andrey

Sorry, I`m not able to reproduce crash after rollback and traces was
uncomplete due to lack of disk space on specified core location, so
please don`t mind it.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 0.55 crashed during upgrade to bobtail

2013-01-01 Thread Andrey Korolyov
On Wed, Jan 2, 2013 at 12:16 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Tue, Jan 1, 2013 at 9:49 PM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,

 All osds in the dev cluster died shortly after upgrade (packet-only,
 i.e. binary upgrade, even without restart running processes), please
 see attached file.

 Was: 0.55.1-356-g850d1d5
 Upgraded to: 0.56 tag

 The only one difference is a version of the gcc corresponding
 libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I
 may do a rollback and problem will eliminate with high probability,
 but seems there should be some fix. Also I have something simular in
 the the testing env days ago -  packet upgrade inside 0.55 killed all
 _windows_ guests_ and one of tens of linux guests running above rbd.
 Unfortunately I have no debug sessions at this moment and I have only
 tail of the log from qemu:

 terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
   what():  buffer::end_of_buffer

 I`m blaming ldconfig action from librbd because nothing else `ll cause
 such case of destroy on the running processes - may be I`m wrong.

 thanks!

 WBR, Andrey

 Sorry, I`m not able to reproduce crash after rollback and traces was
 uncomplete due to lack of disk space on specified core location, so
 please don`t mind it.

Ahem, finally it seems that osd process stumbling on something on the
fs, because my other environments also was able to reproduce crash
once, but reproducing is not possible since new osd process started
over existing filestore(offline version rollback and another try to
online upgrade doing fine). And backtrace in first message is
complete, at least 1 and 2, despite of lack of space first time - I
have received a couple of coredumps which trace looks exactly the same
as 2.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.56 released

2013-01-01 Thread Mark Nelson

Doh!  Sorry about that.  It looks like they are still in rpm-testing:

http://www.ceph.com/rpm-testing/el6/x86_64/

I imagine Gary will have them in the non-testing repo some time tomorrow.

Mark

On 01/01/2013 08:24 PM, Dennis Jacobfeuerborn wrote:

Hi,
apparently the RPM link points to version 0.52. Where can the RPMs for 0.56
be found?

Regards,
   Dennis

On 01/01/2013 07:02 AM, Sage Weil wrote:

We're bringing in the new year with a new release, v0.56, which will form
the basis of the next stable series bobtail. There is little in the way
of new functionality since v0.55, as we've been focusing primarily on
stability, performance, and upgradability from the previous argonaut
stable series (v0.48.x). If you are a current argonaut user, you can
either upgrade now, or watch the Inktank blog for the bobtail announcement
after some additional testing has been completed. If you are a v0.55 or
v0.55.1 user, we recommend upgrading now.

Notable changes since v0.55 include:

  * librbd: fixes for read-only pools for image cloning
  * osd: fix for mixing argonaut and post-v0.54 OSDs
  * osd: some recovery tuning
  * osd: fix for several scrub, recovery, and watch/notify races/bugs
  * osd: fix pool_stat_t backwawrd compatibility with pre-v0.41 clients
  * osd: experimental split support
  * mkcephfs: misc fixes for fs initialization, mounting
  * radosgw: usage and op logs off by default
  * radosgw: keystone authentication off by default
  * upstart: only enabled with 'upstart' file exists in daemon data
directory
  * mount.fuse.ceph: allow mounting of ceph-fuse via /etc/fstab
  * config: always complain about config parsing errors
  * mon: fixed memory leaks, misc bugs
  * mds: many misc fixes

Notable changes since v0.48.2 (argonaut):

  * auth: authentication is now on by default; see release notes!
  * osd: improved threading, small io performance
  * osd: deep scrubbing (verify object data)
  * osd: chunky scrubs (more efficient)
  * osd: improved performance during recovery
  * librbd: cloning support
  * librbd: fine-grained striping support
  * librbd: better caching
  * radosgw: improved Swift and S3 API coverage (POST, multi-object delete,
striping)
  * radosgw: OpenStack Keystone integration
  * radosgw: efficient usage stats aggregation (for billing)
  * crush: improvements in distribution (still off by default; see CRUSH
tunables)
  * ceph-fuse, mds: general stability improvements
  * release RPMs for OpenSUSE, SLES, Fedora, RHEL, CentOS
  * tons and bug fixes and small improvements across the board

If you are upgrading from v0.55, there are no special upgrade
instructions. If you are upgrading from an older version, please read the
release notes. Authentication is now enabled by default, and if you do not
adjust your ceph.conf accordingly before upgrading the system will not
come up by itself.

You can get this release from the usual locations:

  * Git at git://github.com/ceph/ceph.git
  * Tarball at http://ceph.com/download/ceph-0.56.tar.gz
  * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian
  * For RPMs, see http://ceph.com/docs/master/install/rpm

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html