Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10

2015-08-12 Thread Gerd Jakobovitsch

An update:

It seems that I am arriving at memory shortage. Even with 32 GB for 20 
OSDs and 2 GB swap, ceph-osd uses all available memory.
I created another swap device with 10 GB, and I managed to get the 
failed OSD running without crash, but consuming extra 5 GB.

Are there known issues regarding memory on ceph osd?

But I still get the problem of the incomplete+inactive PG.

Regards.

Gerd

On 12-08-2015 10:11, Gerd Jakobovitsch wrote:

I tried it, the error propagates to whichever OSD gets the errorred PG.

For the moment, this is my worst problem. I have one PG 
incomplete+inactive, and the OSD with the highest priority in it gets 
100 blocked requests (I guess that is the maximum), and, although 
running, doesn't get other requests - for example, ceph tell osd.21 
injectargs '--osd-max-backfills 1'. After some time, it crashes, and 
the blocked requests go to the second OSD for the errorred PG. I can't 
get rid of these slow requests.


I guessed a problem with leveldb, I checked, and had the default 
version for debian wheezy (0+20120530.gitdd0d562-1). I updated it for 
wheezy-backports (1.17-1~bpo70+1), but the error was the same.


I use regular wheezy kernel (3.2+46).

On 11-08-2015 23:52, Haomai Wang wrote:

it seems like a leveldb problem. could you just kick it out and add a
new osd to make cluster healthy firstly?

On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitschg...@mandic.net.br  wrote:

Dear all,

I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75%
usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and
since then I got several OSDs crashing and never recovering: trying to run
it, ends up crashing as follows.

Is this problem known? Is there any configuration that should be checked?
Any way to try to recover these OSDs without losing all data?

After that, setting the OSD to lost, I got one incomplete, inactive PG. Is
there any way to recover it? Data still exists in crashed OSDs.

Regards.

[(12:58:13) root@spcsnp3 ~]# service ceph start osd.7
=== osd.7 ===
2015-08-11 12:58:21.003876 7f17ed52b700  1 monclient(hunting): found
mon.spcsmp2
2015-08-11 12:58:21.003915 7f17ef493700  5 monclient: authenticate success,
global_id 206010466
create-or-move updated item name 'osd.7' weight 3.64 at location
{host=spcsnp3,root=default} to crush map
Starting Ceph osd.7 on spcsnp3...
2015-08-11 12:58:21.279878 7f200fa8f780  0 ceph version 0.80.10
(ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918
starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7
/var/lib/ceph/osd/ceph-7/journal
[(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) dump_stop
2015-08-11 12:58:21.348291 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal
/var/lib/ceph/osd/ceph-7/journal
2015-08-11 12:58:21.348326 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) mount fsid is
54c136da-c51c-4799-b2dc-b7988982ee00
2015-08-11 12:58:21.349010 7f200fa8f780  0
filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs)
2015-08-11 12:58:21.349026 7f200fa8f780  1
filestore(/var/lib/ceph/osd/ceph-7)  disabling 'filestore replica fadvise'
due to known issues with fadvise(DONTNEED) on xfs
2015-08-11 12:58:21.353277 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is supported and appears to work
2015-08-11 12:58:21.353302 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2015-08-11 12:58:21.362106 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
syscall(SYS_syncfs, fd) fully supported
2015-08-11 12:58:21.362195 7f200fa8f780  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is
disabled by conf
2015-08-11 12:58:21.362701 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995
2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) **
  in thread 7f200fa8f780

  ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
  1: /usr/bin/ceph-osd() [0xab7562]
  2: (()+0xf0a0) [0x7f200efcd0a0]
  3: (gsignal()+0x35) [0x7f200db3f165]
  4: (abort()+0x180) [0x7f200db423e0]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d]
  6: (()+0x63996) [0x7f200e393996]
  7: (()+0x639c3) [0x7f200e3939c3]
  8: (()+0x63bee) [0x7f200e393bee]
  9: (tc_new()+0x48e) [0x7f200f213aee]
  10: (std::string::_Rep::_S_create(unsigned long, unsigned long,
std::allocatorchar const)+0x59) [0x7f200e3ef999]
  11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned
long)+0x28) [0x7f200e3f0708]
  12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0]
  13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5]
  14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2)
[0x7f200f46ffa2]
  15: 

Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10

2015-08-12 Thread Gerd Jakobovitsch

I tried it, the error propagates to whichever OSD gets the errorred PG.

For the moment, this is my worst problem. I have one PG 
incomplete+inactive, and the OSD with the highest priority in it gets 
100 blocked requests (I guess that is the maximum), and, although 
running, doesn't get other requests - for example, ceph tell osd.21 
injectargs '--osd-max-backfills 1'. After some time, it crashes, and the 
blocked requests go to the second OSD for the errorred PG. I can't get 
rid of these slow requests.


I guessed a problem with leveldb, I checked, and had the default version 
for debian wheezy (0+20120530.gitdd0d562-1). I updated it for 
wheezy-backports (1.17-1~bpo70+1), but the error was the same.


I use regular wheezy kernel (3.2+46).

On 11-08-2015 23:52, Haomai Wang wrote:

it seems like a leveldb problem. could you just kick it out and add a
new osd to make cluster healthy firstly?

On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitsch g...@mandic.net.br wrote:


Dear all,

I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75%
usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and
since then I got several OSDs crashing and never recovering: trying to run
it, ends up crashing as follows.

Is this problem known? Is there any configuration that should be checked?
Any way to try to recover these OSDs without losing all data?

After that, setting the OSD to lost, I got one incomplete, inactive PG. Is
there any way to recover it? Data still exists in crashed OSDs.

Regards.

[(12:58:13) root@spcsnp3 ~]# service ceph start osd.7
=== osd.7 ===
2015-08-11 12:58:21.003876 7f17ed52b700  1 monclient(hunting): found
mon.spcsmp2
2015-08-11 12:58:21.003915 7f17ef493700  5 monclient: authenticate success,
global_id 206010466
create-or-move updated item name 'osd.7' weight 3.64 at location
{host=spcsnp3,root=default} to crush map
Starting Ceph osd.7 on spcsnp3...
2015-08-11 12:58:21.279878 7f200fa8f780  0 ceph version 0.80.10
(ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918
starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7
/var/lib/ceph/osd/ceph-7/journal
[(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) dump_stop
2015-08-11 12:58:21.348291 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal
/var/lib/ceph/osd/ceph-7/journal
2015-08-11 12:58:21.348326 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) mount fsid is
54c136da-c51c-4799-b2dc-b7988982ee00
2015-08-11 12:58:21.349010 7f200fa8f780  0
filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs)
2015-08-11 12:58:21.349026 7f200fa8f780  1
filestore(/var/lib/ceph/osd/ceph-7)  disabling 'filestore replica fadvise'
due to known issues with fadvise(DONTNEED) on xfs
2015-08-11 12:58:21.353277 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is supported and appears to work
2015-08-11 12:58:21.353302 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2015-08-11 12:58:21.362106 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
syscall(SYS_syncfs, fd) fully supported
2015-08-11 12:58:21.362195 7f200fa8f780  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is
disabled by conf
2015-08-11 12:58:21.362701 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995
2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) **
  in thread 7f200fa8f780

  ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
  1: /usr/bin/ceph-osd() [0xab7562]
  2: (()+0xf0a0) [0x7f200efcd0a0]
  3: (gsignal()+0x35) [0x7f200db3f165]
  4: (abort()+0x180) [0x7f200db423e0]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d]
  6: (()+0x63996) [0x7f200e393996]
  7: (()+0x639c3) [0x7f200e3939c3]
  8: (()+0x63bee) [0x7f200e393bee]
  9: (tc_new()+0x48e) [0x7f200f213aee]
  10: (std::string::_Rep::_S_create(unsigned long, unsigned long,
std::allocatorchar const)+0x59) [0x7f200e3ef999]
  11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned
long)+0x28) [0x7f200e3f0708]
  12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0]
  13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5]
  14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2)
[0x7f200f46ffa2]
  15: (leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit*,
unsigned long*)+0x180) [0x7f200f468360]
  16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2)
[0x7f200f46adf2]
  17: (leveldb::DB::Open(leveldb::Options const, std::string const,
leveldb::DB**)+0xff) [0x7f200f46b11f]
  18: (LevelDBStore::do_open(std::ostream, bool)+0xd8) [0xa123a8]
  19: (FileStore::mount()+0x18e0) [0x9b7080]
  20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a]
  

Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10

2015-08-11 Thread Haomai Wang
it seems like a leveldb problem. could you just kick it out and add a
new osd to make cluster healthy firstly?

On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitsch g...@mandic.net.br wrote:


 Dear all,

 I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75%
 usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and
 since then I got several OSDs crashing and never recovering: trying to run
 it, ends up crashing as follows.

 Is this problem known? Is there any configuration that should be checked?
 Any way to try to recover these OSDs without losing all data?

 After that, setting the OSD to lost, I got one incomplete, inactive PG. Is
 there any way to recover it? Data still exists in crashed OSDs.

 Regards.

 [(12:58:13) root@spcsnp3 ~]# service ceph start osd.7
 === osd.7 ===
 2015-08-11 12:58:21.003876 7f17ed52b700  1 monclient(hunting): found
 mon.spcsmp2
 2015-08-11 12:58:21.003915 7f17ef493700  5 monclient: authenticate success,
 global_id 206010466
 create-or-move updated item name 'osd.7' weight 3.64 at location
 {host=spcsnp3,root=default} to crush map
 Starting Ceph osd.7 on spcsnp3...
 2015-08-11 12:58:21.279878 7f200fa8f780  0 ceph version 0.80.10
 (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918
 starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7
 /var/lib/ceph/osd/ceph-7/journal
 [(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10
 filestore(/var/lib/ceph/osd/ceph-7) dump_stop
 2015-08-11 12:58:21.348291 7f200fa8f780  5
 filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal
 /var/lib/ceph/osd/ceph-7/journal
 2015-08-11 12:58:21.348326 7f200fa8f780 10
 filestore(/var/lib/ceph/osd/ceph-7) mount fsid is
 54c136da-c51c-4799-b2dc-b7988982ee00
 2015-08-11 12:58:21.349010 7f200fa8f780  0
 filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs)
 2015-08-11 12:58:21.349026 7f200fa8f780  1
 filestore(/var/lib/ceph/osd/ceph-7)  disabling 'filestore replica fadvise'
 due to known issues with fadvise(DONTNEED) on xfs
 2015-08-11 12:58:21.353277 7f200fa8f780  0
 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
 ioctl is supported and appears to work
 2015-08-11 12:58:21.353302 7f200fa8f780  0
 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
 ioctl is disabled via 'filestore fiemap' config option
 2015-08-11 12:58:21.362106 7f200fa8f780  0
 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
 syscall(SYS_syncfs, fd) fully supported
 2015-08-11 12:58:21.362195 7f200fa8f780  0
 xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is
 disabled by conf
 2015-08-11 12:58:21.362701 7f200fa8f780  5
 filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995
 2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) **
  in thread 7f200fa8f780

  ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
  1: /usr/bin/ceph-osd() [0xab7562]
  2: (()+0xf0a0) [0x7f200efcd0a0]
  3: (gsignal()+0x35) [0x7f200db3f165]
  4: (abort()+0x180) [0x7f200db423e0]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d]
  6: (()+0x63996) [0x7f200e393996]
  7: (()+0x639c3) [0x7f200e3939c3]
  8: (()+0x63bee) [0x7f200e393bee]
  9: (tc_new()+0x48e) [0x7f200f213aee]
  10: (std::string::_Rep::_S_create(unsigned long, unsigned long,
 std::allocatorchar const)+0x59) [0x7f200e3ef999]
  11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned
 long)+0x28) [0x7f200e3f0708]
  12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0]
  13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5]
  14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2)
 [0x7f200f46ffa2]
  15: (leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit*,
 unsigned long*)+0x180) [0x7f200f468360]
  16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2)
 [0x7f200f46adf2]
  17: (leveldb::DB::Open(leveldb::Options const, std::string const,
 leveldb::DB**)+0xff) [0x7f200f46b11f]
  18: (LevelDBStore::do_open(std::ostream, bool)+0xd8) [0xa123a8]
  19: (FileStore::mount()+0x18e0) [0x9b7080]
  20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a]
  21: (main()+0x2234) [0x7331c4]
  22: (__libc_start_main()+0xfd) [0x7f200db2bead]
  23: /usr/bin/ceph-osd() [0x736e99]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to
 interpret this.

 --- begin dump of recent events ---
-66 2015-08-11 12:58:21.277524 7f200fa8f780  5 asok(0x2800230)
 register_command perfcounters_dump hook 0x27f0010
-65 2015-08-11 12:58:21.277552 7f200fa8f780  5 asok(0x2800230)
 register_command 1 hook 0x27f0010
-64 2015-08-11 12:58:21.277556 7f200fa8f780  5 asok(0x2800230)
 register_command perf dump hook 0x27f0010
-63 2015-08-11 12:58:21.277561 7f200fa8f780  5 asok(0x2800230)
 register_command perfcounters_schema hook 0x27f0010
-62 2015-08-11 12:58:21.277564