Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Andrey Korolyov
On Wed, Oct 29, 2014 at 1:37 PM, Haomai Wang  wrote:
> maybe you can run it directly with debug_osd=20/20 and get ending logs
> ceph-osd -i 1 -c /etc/ceph/ceph.conf -f
>
> On Wed, Oct 29, 2014 at 6:34 PM, Andrey Korolyov  wrote:
>> On Wed, Oct 29, 2014 at 1:28 PM, Haomai Wang  wrote:
>>> Thanks!
>>>
>>> You mean osd.1 exited abrptly without ceph callback trace?
>>> Anyone has some ideas about this log? @sage @gregory
>>>
>>>
>>> On Wed, Oct 29, 2014 at 6:19 PM, Andrey Korolyov  wrote:
 On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang  wrote:
> Thanks for Andrey,
>
> The attachment OSD.1's log is only these lines? I really can't find
> the detail infos from it?
>
> Maybe you need to improve debug_osd to 20/20?
>
> On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov  wrote:
>> Hi Haomai, all.
>>
>> Today after unexpected power failure one of kv stores (placed on ext4
>> with default mount options) refused to work. I think that it may be
>> interesting to revive it because it is almost first time among
>> hundreds of power failures (and their simulations) when data store got
>> broken.
>>
>> Strace:
>> http://xdel.ru/downloads/osd1.strace.gz
>>
>> Debug output with 20-everything level:
>> http://xdel.ru/downloads/osd1.out
>
>
>
> --
> Best Regards,
>
> Wheat


 Unfortunately that`s all I`ve got. Updated osd1.out to show an actual
 cli args and entire output - it ends abruptly without last newline and
 without any valuable output.
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>> With log-file specified, it adds just following line at very end:
>>
>> 2014-10-29 13:29:57.437776 7ffa562c9840 -1  ** ERROR: osd init failed:
>> (22) Invalid argument
>>
>> the stdout printing seems a bit broken and do not print this at all
>> (and store output part is definitely is not detailed enough to make
>> any conclusions, and even file a bug). CCing Sage/Greg.
>
>
>
> --
> Best Regards,
>
> Wheat

-f does not print the last line to stderr either. Ok, it looks like
very minor separate bug, but I remembering its appearance long before,
so but as bug remains, probably it does not bother anyone - the stderr
output is less usual for debugging purposes.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Haomai Wang
maybe you can run it directly with debug_osd=20/20 and get ending logs
ceph-osd -i 1 -c /etc/ceph/ceph.conf -f

On Wed, Oct 29, 2014 at 6:34 PM, Andrey Korolyov  wrote:
> On Wed, Oct 29, 2014 at 1:28 PM, Haomai Wang  wrote:
>> Thanks!
>>
>> You mean osd.1 exited abrptly without ceph callback trace?
>> Anyone has some ideas about this log? @sage @gregory
>>
>>
>> On Wed, Oct 29, 2014 at 6:19 PM, Andrey Korolyov  wrote:
>>> On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang  wrote:
 Thanks for Andrey,

 The attachment OSD.1's log is only these lines? I really can't find
 the detail infos from it?

 Maybe you need to improve debug_osd to 20/20?

 On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov  wrote:
> Hi Haomai, all.
>
> Today after unexpected power failure one of kv stores (placed on ext4
> with default mount options) refused to work. I think that it may be
> interesting to revive it because it is almost first time among
> hundreds of power failures (and their simulations) when data store got
> broken.
>
> Strace:
> http://xdel.ru/downloads/osd1.strace.gz
>
> Debug output with 20-everything level:
> http://xdel.ru/downloads/osd1.out



 --
 Best Regards,

 Wheat
>>>
>>>
>>> Unfortunately that`s all I`ve got. Updated osd1.out to show an actual
>>> cli args and entire output - it ends abruptly without last newline and
>>> without any valuable output.
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
> With log-file specified, it adds just following line at very end:
>
> 2014-10-29 13:29:57.437776 7ffa562c9840 -1  ** ERROR: osd init failed:
> (22) Invalid argument
>
> the stdout printing seems a bit broken and do not print this at all
> (and store output part is definitely is not detailed enough to make
> any conclusions, and even file a bug). CCing Sage/Greg.



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Andrey Korolyov
On Wed, Oct 29, 2014 at 1:28 PM, Haomai Wang  wrote:
> Thanks!
>
> You mean osd.1 exited abrptly without ceph callback trace?
> Anyone has some ideas about this log? @sage @gregory
>
>
> On Wed, Oct 29, 2014 at 6:19 PM, Andrey Korolyov  wrote:
>> On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang  wrote:
>>> Thanks for Andrey,
>>>
>>> The attachment OSD.1's log is only these lines? I really can't find
>>> the detail infos from it?
>>>
>>> Maybe you need to improve debug_osd to 20/20?
>>>
>>> On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov  wrote:
 Hi Haomai, all.

 Today after unexpected power failure one of kv stores (placed on ext4
 with default mount options) refused to work. I think that it may be
 interesting to revive it because it is almost first time among
 hundreds of power failures (and their simulations) when data store got
 broken.

 Strace:
 http://xdel.ru/downloads/osd1.strace.gz

 Debug output with 20-everything level:
 http://xdel.ru/downloads/osd1.out
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>> Unfortunately that`s all I`ve got. Updated osd1.out to show an actual
>> cli args and entire output - it ends abruptly without last newline and
>> without any valuable output.
>
>
>
> --
> Best Regards,
>
> Wheat


With log-file specified, it adds just following line at very end:

2014-10-29 13:29:57.437776 7ffa562c9840 -1  ** ERROR: osd init failed:
(22) Invalid argument

the stdout printing seems a bit broken and do not print this at all
(and store output part is definitely is not detailed enough to make
any conclusions, and even file a bug). CCing Sage/Greg.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Haomai Wang
Thanks!

You mean osd.1 exited abrptly without ceph callback trace?
Anyone has some ideas about this log? @sage @gregory


On Wed, Oct 29, 2014 at 6:19 PM, Andrey Korolyov  wrote:
> On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang  wrote:
>> Thanks for Andrey,
>>
>> The attachment OSD.1's log is only these lines? I really can't find
>> the detail infos from it?
>>
>> Maybe you need to improve debug_osd to 20/20?
>>
>> On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov  wrote:
>>> Hi Haomai, all.
>>>
>>> Today after unexpected power failure one of kv stores (placed on ext4
>>> with default mount options) refused to work. I think that it may be
>>> interesting to revive it because it is almost first time among
>>> hundreds of power failures (and their simulations) when data store got
>>> broken.
>>>
>>> Strace:
>>> http://xdel.ru/downloads/osd1.strace.gz
>>>
>>> Debug output with 20-everything level:
>>> http://xdel.ru/downloads/osd1.out
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
> Unfortunately that`s all I`ve got. Updated osd1.out to show an actual
> cli args and entire output - it ends abruptly without last newline and
> without any valuable output.



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Andrey Korolyov
On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang  wrote:
> Thanks for Andrey,
>
> The attachment OSD.1's log is only these lines? I really can't find
> the detail infos from it?
>
> Maybe you need to improve debug_osd to 20/20?
>
> On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov  wrote:
>> Hi Haomai, all.
>>
>> Today after unexpected power failure one of kv stores (placed on ext4
>> with default mount options) refused to work. I think that it may be
>> interesting to revive it because it is almost first time among
>> hundreds of power failures (and their simulations) when data store got
>> broken.
>>
>> Strace:
>> http://xdel.ru/downloads/osd1.strace.gz
>>
>> Debug output with 20-everything level:
>> http://xdel.ru/downloads/osd1.out
>
>
>
> --
> Best Regards,
>
> Wheat


Unfortunately that`s all I`ve got. Updated osd1.out to show an actual
cli args and entire output - it ends abruptly without last newline and
without any valuable output.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Haomai Wang
Thanks for Andrey,

The attachment OSD.1's log is only these lines? I really can't find
the detail infos from it?

Maybe you need to improve debug_osd to 20/20?

On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov  wrote:
> Hi Haomai, all.
>
> Today after unexpected power failure one of kv stores (placed on ext4
> with default mount options) refused to work. I think that it may be
> interesting to revive it because it is almost first time among
> hundreds of power failures (and their simulations) when data store got
> broken.
>
> Strace:
> http://xdel.ru/downloads/osd1.strace.gz
>
> Debug output with 20-everything level:
> http://xdel.ru/downloads/osd1.out



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Andrey Korolyov
Hi Haomai, all.

Today after unexpected power failure one of kv stores (placed on ext4
with default mount options) refused to work. I think that it may be
interesting to revive it because it is almost first time among
hundreds of power failures (and their simulations) when data store got
broken.

Strace:
http://xdel.ru/downloads/osd1.strace.gz

Debug output with 20-everything level:
http://xdel.ru/downloads/osd1.out
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-26 Thread 廖建锋

I reported that problem a couple of weeks ago

From: ceph-users<mailto:ceph-users-boun...@lists.ceph.com>
Date: 2014-10-26 17:46
To: Haomai Wang<mailto:haomaiw...@gmail.com>
CC: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

On Sun, Oct 26, 2014 at 7:40 AM, Haomai Wang  wrote:
> On Sun, Oct 26, 2014 at 3:12 AM, Andrey Korolyov  wrote:
>> Thanks Haomai. Turns out that the master` recovery is too buggy right
>> now (recovery speed degrades over a time, OSD (non-kv) is going out of
>> cluster with no reason, misplaced object calculation is wrong and so
>> on), so I am sticking to giant with rocksdb now. So far no major
>> problems are revealed.
>
> Hmm, do you mean kvstore has problem on osd recovery? I'm eager to
> know the operations about how to produce this situation. Could you
> give more detail?
>
>
>
> --
> Best Regards,
>
> Wheat


I`m not sure if kv has triggered any of those, it`s just a side effect
of deploying master branch (and OSDs which showed problems was not in
kv subset only). Looks like both giant and master are exposing some
problem with pg recalculation on tight-IO conditions for MON (MONs are
sharing disk with one of OSD each and post-peering recalculation may
take some minutes when kv-based OSDs are involved, also recalculation
from active+remapped -> active+degraded(+...) takes tens of minutes;
the same 'non-optimal' setup worked well before and all recalculations
was made in a matter of tens of seconds, so I will investigate this a
bit later). Giant crashed on non-KV daemons during nightly recovery,
so there is a more critical stuff to fix right now because  kv so far
did not exposed any crashes by itself.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-26 Thread Andrey Korolyov
On Sun, Oct 26, 2014 at 7:40 AM, Haomai Wang  wrote:
> On Sun, Oct 26, 2014 at 3:12 AM, Andrey Korolyov  wrote:
>> Thanks Haomai. Turns out that the master` recovery is too buggy right
>> now (recovery speed degrades over a time, OSD (non-kv) is going out of
>> cluster with no reason, misplaced object calculation is wrong and so
>> on), so I am sticking to giant with rocksdb now. So far no major
>> problems are revealed.
>
> Hmm, do you mean kvstore has problem on osd recovery? I'm eager to
> know the operations about how to produce this situation. Could you
> give more detail?
>
>
>
> --
> Best Regards,
>
> Wheat


I`m not sure if kv has triggered any of those, it`s just a side effect
of deploying master branch (and OSDs which showed problems was not in
kv subset only). Looks like both giant and master are exposing some
problem with pg recalculation on tight-IO conditions for MON (MONs are
sharing disk with one of OSD each and post-peering recalculation may
take some minutes when kv-based OSDs are involved, also recalculation
from active+remapped -> active+degraded(+...) takes tens of minutes;
the same 'non-optimal' setup worked well before and all recalculations
was made in a matter of tens of seconds, so I will investigate this a
bit later). Giant crashed on non-KV daemons during nightly recovery,
so there is a more critical stuff to fix right now because  kv so far
did not exposed any crashes by itself.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-25 Thread Haomai Wang
On Sun, Oct 26, 2014 at 3:12 AM, Andrey Korolyov  wrote:
> Thanks Haomai. Turns out that the master` recovery is too buggy right
> now (recovery speed degrades over a time, OSD (non-kv) is going out of
> cluster with no reason, misplaced object calculation is wrong and so
> on), so I am sticking to giant with rocksdb now. So far no major
> problems are revealed.

Hmm, do you mean kvstore has problem on osd recovery? I'm eager to
know the operations about how to produce this situation. Could you
give more detail?



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-25 Thread Andrey Korolyov
Thanks Haomai. Turns out that the master` recovery is too buggy right
now (recovery speed degrades over a time, OSD (non-kv) is going out of
cluster with no reason, misplaced object calculation is wrong and so
on), so I am sticking to giant with rocksdb now. So far no major
problems are revealed.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-24 Thread Haomai Wang
It's not stable at Firely for kvstore. But for the master branch, it's
should be no existing/known bug.

On Fri, Oct 24, 2014 at 7:41 PM, Andrey Korolyov  wrote:
> Hi,
>
> during recovery testing on a latest firefly with leveldb backend we
> found that the OSDs on a selected host may crash at once, leaving
> attached backtrace. In other ways, recovery goes more or less smoothly
> for hours.
>
> Timestamps shows how the issue is correlated between different
> processes on same node:
>
> core.ceph-osd.25426.node01.1414148261
> core.ceph-osd.25734.node01.1414148263
> core.ceph-osd.25566.node01.1414148345
>
> The question is about kv backend state in Firefly - is it considered
> stable enough to run production test against it or we should better
> move to giant/master for this?
>
> Thanks!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-24 Thread Andrey Korolyov
Hi,

during recovery testing on a latest firefly with leveldb backend we
found that the OSDs on a selected host may crash at once, leaving
attached backtrace. In other ways, recovery goes more or less smoothly
for hours.

Timestamps shows how the issue is correlated between different
processes on same node:

core.ceph-osd.25426.node01.1414148261
core.ceph-osd.25734.node01.1414148263
core.ceph-osd.25566.node01.1414148345

The question is about kv backend state in Firefly - is it considered
stable enough to run production test against it or we should better
move to giant/master for this?

Thanks!
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /usr/bin/ceph-osd...Reading symbols from 
/usr/lib/debug/usr/bin/ceph-osd...done.
done.
[New LWP 10182]
[New LWP 10183]
[New LWP 10699]
[New LWP 10184]
[New LWP 10703]
[New LWP 10704]
[New LWP 10702]
[New LWP 10708]
[New LWP 10707]
[New LWP 10710]
[New LWP 10700]
[New LWP 10717]
[New LWP 10765]
[New LWP 10705]
[New LWP 10706]
[New LWP 10701]
[New LWP 10712]
[New LWP 10735]
[New LWP 10713]
[New LWP 10750]
[New LWP 10718]
[New LWP 10711]
[New LWP 10716]
[New LWP 10715]
[New LWP 10785]
[New LWP 10766]
[New LWP 10796]
[New LWP 10720]
[New LWP 10725]
[New LWP 10736]
[New LWP 10709]
[New LWP 10730]
[New LWP 11541]
[New LWP 10770]
[New LWP 11573]
[New LWP 10778]
[New LWP 10804]
[New LWP 11561]
[New LWP 9388]
[New LWP 9398]
[New LWP 11538]
[New LWP 10790]
[New LWP 11586]
[New LWP 10798]
[New LWP 9910]
[New LWP 10726]
[New LWP 21823]
[New LWP 10815]
[New LWP 9397]
[New LWP 11248]
[New LWP 10723]
[New LWP 11253]
[New LWP 10728]
[New LWP 10791]
[New LWP 9389]
[New LWP 10724]
[New LWP 10780]
[New LWP 11287]
[New LWP 11592]
[New LWP 10816]
[New LWP 10812]
[New LWP 10787]
[New LWP 20622]
[New LWP 21822]
[New LWP 10751]
[New LWP 10768]
[New LWP 10767]
[New LWP 11874]
[New LWP 10733]
[New LWP 10811]
[New LWP 11574]
[New LWP 11873]
[New LWP 10771]
[New LWP 11551]
[New LWP 10799]
[New LWP 10729]
[New LWP 18254]
[New LWP 10792]
[New LWP 10803]
[New LWP 9912]
[New LWP 11293]
[New LWP 20623]
[New LWP 14805]
[New LWP 10773]
[New LWP 11298]
[New LWP 11872]
[New LWP 10763]
[New LWP 10783]
[New LWP 10769]
[New LWP 11300]
[New LWP 10777]
[New LWP 10764]
[New LWP 10802]
[New LWP 10749]
[New LWP 14806]
[New LWP 10806]
[New LWP 10805]
[New LWP 18255]
[New LWP 10181]
[New LWP 11277]
[New LWP 9913]
[New LWP 10800]
[New LWP 10801]
[New LWP 11555]
[New LWP 11871]
[New LWP 10748]
[New LWP 9915]
[New LWP 10779]
[New LWP 11294]
[New LWP 9916]
[New LWP 10757]
[New LWP 10734]
[New LWP 10786]
[New LWP 10727]
[New LWP 19063]
[New LWP 11279]
[New LWP 9905]
[New LWP 9911]
[New LWP 10772]
[New LWP 10722]
[New LWP 9914]
[New LWP 10789]
[New LWP 11540]
[New LWP 9917]
[New LWP 11289]
[New LWP 10714]
[New LWP 10721]
[New LWP 10719]
[New LWP 10788]
[New LWP 10782]
[New LWP 10784]
[New LWP 10776]
[New LWP 10774]
[New LWP 10737]
[New LWP 19064]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/ceph-osd -i 1 --pid-file 
/var/run/ceph/osd.1.pid -c /etc/ceph/ceph.con'.
Program terminated with signal 6, Aborted.
#0  0x7ff9ad91eb7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) 
Thread 135 (Thread 0x7ff99a492700 (LWP 19064)):
#0  0x7ff9ad91ad84 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00c496da in Wait (mutex=..., this=0x108cd110) at 
./common/Cond.h:55
#2  Pipe::writer (this=0x108ccf00) at msg/Pipe.cc:1730
#3  0x00c5485d in Pipe::Writer::entry (this=) at 
msg/Pipe.h:61
#4  0x7ff9ad916e9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x7ff9ac4a43dd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x in ?? ()

Thread 134 (Thread 0x7ff975e1d700 (LWP 10737)):
#0  0x7ff9ac498a13 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00c3e73c in Pipe::tcp_read_wait (this=this@entry=0x4a53180) at 
msg/Pipe.cc:2282
#2  0x00c3e9d0 in Pipe::tcp_read (this=this@entry=0x4a53180, 
buf=, buf@entry=0x7ff975e1cccf "\377", len=len@entry=1)
at msg/Pipe.cc:2255
#3  0x00c5095f in Pipe::reader (this=0x4a53180) at msg/Pipe.cc:1421
#4  0x00c5497d in Pipe::Reader::entry (this=) at 
msg/Pipe.h:49
#5  0x7ff9ad916e9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#6  0x7ff9ac4a43dd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x in ?? ()

Thread 133 (Thread 0x7ff972dda700 (LWP 10774)):
#0  0x7ff9ac49