Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-08 Thread Jan Pekař - Imatic

You were right, it was frozen at virtual machine level.
panic kernel parameter worked, so server resumed with reboot.

But there were no panic displayed on the VNC console even if I was logged.

The main problem is, that combination of MON and OSD silent failure at 
once will cause much longer resuming from that state.


In my case

approx at 18:38:11 I paused MON+OSD
at 18:38:17 I have first heartbeat_check: no reply from
at 18:38:30 I have libceph: mon1 [X]:6789 session lost, hunting for new mon
at 18:38:30 I have libceph: mon2 [X]:6789 session established
at 18:39:05 imatic-hydra01 kernel: [2384345.121219] libceph: osd6 down

So it took 54 seconds in my case to resume IO and recover. Is that 
normal and expected?
I think that long time is because MON hunt was run during the OSD error 
and another monitor won election, so after that timeouts for kicking OSD 
out are running from the very beginning.


When considering timeouts, everybody must count with MON recover timeout 
+ OSD recover timeout as worst scenario for IO outage. Even they are 
hosted on different machines, they can fail in the same time.


Do you have any recommendation for reliable heartbeat and other settings 
for virtual machines with ext4, xfs and NTFS to be safe?


Thank you
With regards
Jan Pekar




On 7.11.2017 00:30, Jason Dillaman wrote:
If you could install the debug packages and get a gdb backtrace from all 
threads it would be helpful. librbd doesn't utilize any QEMU threads so 
even if librbd was deadlocked, the worst case that I would expect would 
be your guest OS complaining about hung kernel tasks related to disk IO 
(since the disk wouldn't be responding).


On Mon, Nov 6, 2017 at 6:02 PM, Jan Pekař - Imatic > wrote:


Hi,

I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
1:2.8+dfsg-6+deb9u3
I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.

When I tested the cluster, I detected strange and severe problem.
On first node I'm running qemu hosts with librados disk connection
to the cluster and all 3 monitors mentioned in connection.
On second node I stopped mon and osd with command

kill -STOP MONPID OSDPID

Within one minute all my qemu hosts on first node freeze, so they
even don't respond to ping. On VNC screen there is no error (disk or
kernel panic), they just hung forever with no console response. Even
starting MON and OSD on stopped host doesn't make them running.
Destroying the qemu domain and running again is the only solution.

This happens even if virtual machine has all primary OSD on other
OSDs from that I have stopped - so it is not writing primary to the
stopped OSD.

If I stop only OSD and MON keep running, or I stop only MON and OSD
keep running everything looks OK.

When I stop MON and OSD, I can see in log  osd.0 1300
heartbeat_check: no reply from ... as usual when OSD fails. During
this are virtuals still running, but after that they all stop.

What should I send you to debug this problem? Without fixing that,
ceph is not reliable to me.

Thank you
With regards
Jan Pekar
Imatic
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Jason


--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Jan Pekař - Imatic

I am using librbd.

rbd map was only my test to see, if it is not librbd related. Both - 
librbd and rbd map were the same frozen result.


Node running virtuals has 4.9.0-3-amd64 kernel

Two tested virtuals have
4.9.0-3-amd64 kernel, second with
4.10.17-2-pve kernel

JP

On 7.11.2017 10:42, Wido den Hollander wrote:



Op 7 november 2017 om 10:14 schreef Jan Pekař - Imatic :


Additional info - it is not librbd related, I mapped disk through
rbd map and it was the same - virtuals were stuck/frozen.
I happened exactly when in my log appeared



Why aren't you using librbd? Is there a specific reason for that? With 
Qemu/KVM/libvirt I always suggest to use librbd.

And in addition, what kernel version are you running?

Wido


Nov  7 10:01:27 imatic-hydra01 kernel: [2266883.493688] libceph: osd6 down

I can attach with strace to qemu process and I can get this running in loop:

root@imatic-hydra01:/usr/local/libvirt/bin# strace -p 31963
strace: Process 31963 attached
ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46,
events=POLLIN}], 6, {tv_sec=0, tv_nsec=355313847}, NULL, 8) = 0 (Timeout)
poll([{fd=10, events=POLLOUT}], 1, 0)   = 1 ([{fd=10,
revents=POLLOUT|POLLHUP}])
ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46,
events=POLLIN}], 6, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
poll([{fd=10, events=POLLOUT}], 1, 0)   = 1 ([{fd=10,
revents=POLLOUT|POLLHUP}])
ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46,
events=POLLIN}], 6, {tv_sec=0, tv_nsec=493273904}, NULL, 8) = 0 (Timeout)
Process 31963 detached
   

Can you please give me brief info, what should I debug and how can I do
that? I'm newbie in gdb debugging.
It is not problem inside the virtual machine (like disk not responding)
because I can't even get to VNC console and there is no kernel panic
visible on it. Also I suppose kernel should ping without disk being
available.

Thank you

With regards
Jan Pekar



On 7.11.2017 00:30, Jason Dillaman wrote:

If you could install the debug packages and get a gdb backtrace from all
threads it would be helpful. librbd doesn't utilize any QEMU threads so
even if librbd was deadlocked, the worst case that I would expect would
be your guest OS complaining about hung kernel tasks related to disk IO
(since the disk wouldn't be responding).

On Mon, Nov 6, 2017 at 6:02 PM, Jan Pekař - Imatic > wrote:

 Hi,

 I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
 1:2.8+dfsg-6+deb9u3
 I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.

 When I tested the cluster, I detected strange and severe problem.
 On first node I'm running qemu hosts with librados disk connection
 to the cluster and all 3 monitors mentioned in connection.
 On second node I stopped mon and osd with command

 kill -STOP MONPID OSDPID

 Within one minute all my qemu hosts on first node freeze, so they
 even don't respond to ping. On VNC screen there is no error (disk or
 kernel panic), they just hung forever with no console response. Even
 starting MON and OSD on stopped host doesn't make them running.
 Destroying the qemu domain and running again is the only solution.

 This happens even if virtual machine has all primary OSD on other
 OSDs from that I have stopped - so it is not writing primary to the
 stopped OSD.

 If I stop only OSD and MON keep running, or I stop only MON and OSD
 keep running everything looks OK.

 When I stop MON and OSD, I can see in log  osd.0 1300
 heartbeat_check: no reply from ... as usual when OSD fails. During
 this are virtuals still running, but after that they all stop.

 What should I send you to debug this problem? Without fixing that,
 ceph is not reliable to me.

 Thank you
 With regards
 Jan Pekar
 Imatic
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 




--
Jason


--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Jan Pekař - Imatic

I migrated virtual to my second node which is running
qemu-kvm version 1:2.1+dfsg-12+deb8u6 (from debian oldstable)
the same situation - frozen after approx 30-40 seconds when
"libceph: osd6 down" appeared in syslog (not before).
Also my other virtual on first node was frozen in the same time.
Both virtuals are running debian stretch, one with
4.9.0-3-amd64 kernel, second with
4.10.17-2-pve kernel

Cannot test Windows virtuals now.

One of my virtuals is on pool, where i forced primary OSD to other nodes 
(OSDs) than I'm stopping and I have pool min_size 1, so I assume (when 
PRIMARY OSD is still online and available) I shouldn't have issue with 
disk writes or reads. But that virtual is also affected and don't 
survive MON+OSD stopping.


I tried to set

[global]
heartbeat interval = 5
[osd]
osd heartbeat interval = 3
osd heartbeat grace = 10

in my ceph.conf

and after my test I got no "heartbeat_check: no reply from" in syslog, 
just "libceph: osd6 down" and virtuals survived that.
That can be workaround for me, but it can also be only coincidence that 
other part of mon code disabled osd before my problem occured. I also 
assume, that everybody else is using defaults heartbeat settings.
My cluster was installed on luminous (not migrated from previous 
versions) and node OS is stretch (one node is lenny).


With regards
Jan Pekar
Imatic


On 7.11.2017 14:16, Jason Dillaman wrote:
If you are seeing this w/ librbd and krbd, I would suggest trying a 
different version of QEMU and/or different host OS since loss of a disk 
shouldn't hang it -- only potentially the guest OS.


On Tue, Nov 7, 2017 at 5:17 AM, Jan Pekař - Imatic > wrote:


I'm calling kill -STOP to simulate behavior, that occurred, when on
one ceph node i was out of memory. Processes was not killed, but
were somehow suspended/unresponsible (they couldn't create new
threads etc), and that caused all virtuals (on other nodes) to hung.
I decided to simulate it with kill -STOP MONPID OSDPID and I succeeded.

After I stop MON with OSD, it took few seconds to get osd
unresponsive messages, and exactly when I get final
libceph: osd6 down
all my virtuals stops responding (stop pinging, unable to use VNC etc)
Tried with librdb disk definition or rbd map device attached inside
QEMU/KVM virtuals.

JP


On 7.11.2017 10:57, Piotr Dałek wrote:

On 17-11-07 12:02 AM, Jan Pekař - Imatic wrote:

Hi,

I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
1:2.8+dfsg-6+deb9u3
I'm running 3 nodes with 3 monitors and 8 osds on my nodes,
all on IPV6.

When I tested the cluster, I detected strange and severe
problem.
On first node I'm running qemu hosts with librados disk
connection to the cluster and all 3 monitors mentioned in
connection.
On second node I stopped mon and osd with command

kill -STOP MONPID OSDPID

Within one minute all my qemu hosts on first node freeze, so
they even don't respond to ping. [..]


Why would you want to *stop* (as in, freeze) a process instead
of killing it?
Anyway, with processes still there, it may take a few minutes
before cluster realizes that daemons are stopped and kicks it
out of cluster, restoring normal behavior (assuming correctly
set crush rules).


-- 


Ing. Jan Pekař
jan.pe...@imatic.cz  | +420603811737


Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Jason


--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Jason Dillaman
If you are seeing this w/ librbd and krbd, I would suggest trying a
different version of QEMU and/or different host OS since loss of a disk
shouldn't hang it -- only potentially the guest OS.

On Tue, Nov 7, 2017 at 5:17 AM, Jan Pekař - Imatic 
wrote:

> I'm calling kill -STOP to simulate behavior, that occurred, when on one
> ceph node i was out of memory. Processes was not killed, but were somehow
> suspended/unresponsible (they couldn't create new threads etc), and that
> caused all virtuals (on other nodes) to hung.
> I decided to simulate it with kill -STOP MONPID OSDPID and I succeeded.
>
> After I stop MON with OSD, it took few seconds to get osd unresponsive
> messages, and exactly when I get final
> libceph: osd6 down
> all my virtuals stops responding (stop pinging, unable to use VNC etc)
> Tried with librdb disk definition or rbd map device attached inside
> QEMU/KVM virtuals.
>
> JP
>
>
> On 7.11.2017 10:57, Piotr Dałek wrote:
>
>> On 17-11-07 12:02 AM, Jan Pekař - Imatic wrote:
>>
>>> Hi,
>>>
>>> I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
>>> 1:2.8+dfsg-6+deb9u3
>>> I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.
>>>
>>> When I tested the cluster, I detected strange and severe problem.
>>> On first node I'm running qemu hosts with librados disk connection to
>>> the cluster and all 3 monitors mentioned in connection.
>>> On second node I stopped mon and osd with command
>>>
>>> kill -STOP MONPID OSDPID
>>>
>>> Within one minute all my qemu hosts on first node freeze, so they even
>>> don't respond to ping. [..]
>>>
>>
>> Why would you want to *stop* (as in, freeze) a process instead of killing
>> it?
>> Anyway, with processes still there, it may take a few minutes before
>> cluster realizes that daemons are stopped and kicks it out of cluster,
>> restoring normal behavior (assuming correctly set crush rules).
>>
>>
> --
> 
> Ing. Jan Pekař
> jan.pe...@imatic.cz | +420603811737
> 
> Imatic | Jagellonská 14 | Praha 3 | 130 00
> http://www.imatic.cz
> 
> --
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Jan Pekař - Imatic
I'm calling kill -STOP to simulate behavior, that occurred, when on one 
ceph node i was out of memory. Processes was not killed, but were 
somehow suspended/unresponsible (they couldn't create new threads etc), 
and that caused all virtuals (on other nodes) to hung.

I decided to simulate it with kill -STOP MONPID OSDPID and I succeeded.

After I stop MON with OSD, it took few seconds to get osd unresponsive 
messages, and exactly when I get final

libceph: osd6 down
all my virtuals stops responding (stop pinging, unable to use VNC etc)
Tried with librdb disk definition or rbd map device attached inside 
QEMU/KVM virtuals.


JP


On 7.11.2017 10:57, Piotr Dałek wrote:

On 17-11-07 12:02 AM, Jan Pekař - Imatic wrote:

Hi,

I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu 
1:2.8+dfsg-6+deb9u3

I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.

When I tested the cluster, I detected strange and severe problem.
On first node I'm running qemu hosts with librados disk connection to 
the cluster and all 3 monitors mentioned in connection.

On second node I stopped mon and osd with command

kill -STOP MONPID OSDPID

Within one minute all my qemu hosts on first node freeze, so they even 
don't respond to ping. [..]


Why would you want to *stop* (as in, freeze) a process instead of 
killing it?
Anyway, with processes still there, it may take a few minutes before 
cluster realizes that daemons are stopped and kicks it out of cluster, 
restoring normal behavior (assuming correctly set crush rules).




--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Piotr Dałek

On 17-11-07 12:02 AM, Jan Pekař - Imatic wrote:

Hi,

I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu 
1:2.8+dfsg-6+deb9u3

I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.

When I tested the cluster, I detected strange and severe problem.
On first node I'm running qemu hosts with librados disk connection to the 
cluster and all 3 monitors mentioned in connection.

On second node I stopped mon and osd with command

kill -STOP MONPID OSDPID

Within one minute all my qemu hosts on first node freeze, so they even don't 
respond to ping. [..]


Why would you want to *stop* (as in, freeze) a process instead of killing it?
Anyway, with processes still there, it may take a few minutes before cluster 
realizes that daemons are stopped and kicks it out of cluster, restoring 
normal behavior (assuming correctly set crush rules).


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Wido den Hollander

> Op 7 november 2017 om 10:14 schreef Jan Pekař - Imatic :
> 
> 
> Additional info - it is not librbd related, I mapped disk through
> rbd map and it was the same - virtuals were stuck/frozen.
> I happened exactly when in my log appeared
> 

Why aren't you using librbd? Is there a specific reason for that? With 
Qemu/KVM/libvirt I always suggest to use librbd.

And in addition, what kernel version are you running?

Wido

> Nov  7 10:01:27 imatic-hydra01 kernel: [2266883.493688] libceph: osd6 down
> 
> I can attach with strace to qemu process and I can get this running in loop:
> 
> root@imatic-hydra01:/usr/local/libvirt/bin# strace -p 31963
> strace: Process 31963 attached
> ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, 
> events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46, 
> events=POLLIN}], 6, {tv_sec=0, tv_nsec=355313847}, NULL, 8) = 0 (Timeout)
> poll([{fd=10, events=POLLOUT}], 1, 0)   = 1 ([{fd=10, 
> revents=POLLOUT|POLLHUP}])
> ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, 
> events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46, 
> events=POLLIN}], 6, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
> poll([{fd=10, events=POLLOUT}], 1, 0)   = 1 ([{fd=10, 
> revents=POLLOUT|POLLHUP}])
> ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, 
> events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46, 
> events=POLLIN}], 6, {tv_sec=0, tv_nsec=493273904}, NULL, 8) = 0 (Timeout)
> Process 31963 detached
>   
> 
> Can you please give me brief info, what should I debug and how can I do 
> that? I'm newbie in gdb debugging.
> It is not problem inside the virtual machine (like disk not responding) 
> because I can't even get to VNC console and there is no kernel panic 
> visible on it. Also I suppose kernel should ping without disk being 
> available.
> 
> Thank you
> 
> With regards
> Jan Pekar
> 
> 
> 
> On 7.11.2017 00:30, Jason Dillaman wrote:
> > If you could install the debug packages and get a gdb backtrace from all 
> > threads it would be helpful. librbd doesn't utilize any QEMU threads so 
> > even if librbd was deadlocked, the worst case that I would expect would 
> > be your guest OS complaining about hung kernel tasks related to disk IO 
> > (since the disk wouldn't be responding).
> > 
> > On Mon, Nov 6, 2017 at 6:02 PM, Jan Pekař - Imatic  > > wrote:
> > 
> > Hi,
> > 
> > I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
> > 1:2.8+dfsg-6+deb9u3
> > I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.
> > 
> > When I tested the cluster, I detected strange and severe problem.
> > On first node I'm running qemu hosts with librados disk connection
> > to the cluster and all 3 monitors mentioned in connection.
> > On second node I stopped mon and osd with command
> > 
> > kill -STOP MONPID OSDPID
> > 
> > Within one minute all my qemu hosts on first node freeze, so they
> > even don't respond to ping. On VNC screen there is no error (disk or
> > kernel panic), they just hung forever with no console response. Even
> > starting MON and OSD on stopped host doesn't make them running.
> > Destroying the qemu domain and running again is the only solution.
> > 
> > This happens even if virtual machine has all primary OSD on other
> > OSDs from that I have stopped - so it is not writing primary to the
> > stopped OSD.
> > 
> > If I stop only OSD and MON keep running, or I stop only MON and OSD
> > keep running everything looks OK.
> > 
> > When I stop MON and OSD, I can see in log  osd.0 1300
> > heartbeat_check: no reply from ... as usual when OSD fails. During
> > this are virtuals still running, but after that they all stop.
> > 
> > What should I send you to debug this problem? Without fixing that,
> > ceph is not reliable to me.
> > 
> > Thank you
> > With regards
> > Jan Pekar
> > Imatic
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> > 
> > 
> > 
> > -- 
> > Jason
> 
> -- 
> 
> Ing. Jan Pekař
> jan.pe...@imatic.cz | +420603811737
> 
> Imatic | Jagellonská 14 | Praha 3 | 130 00
> http://www.imatic.cz
> 
> --
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Jan Pekař - Imatic

Additional info - it is not librbd related, I mapped disk through
rbd map and it was the same - virtuals were stuck/frozen.
I happened exactly when in my log appeared

Nov  7 10:01:27 imatic-hydra01 kernel: [2266883.493688] libceph: osd6 down

I can attach with strace to qemu process and I can get this running in loop:

root@imatic-hydra01:/usr/local/libvirt/bin# strace -p 31963
strace: Process 31963 attached
ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46, 
events=POLLIN}], 6, {tv_sec=0, tv_nsec=355313847}, NULL, 8) = 0 (Timeout)
poll([{fd=10, events=POLLOUT}], 1, 0)   = 1 ([{fd=10, 
revents=POLLOUT|POLLHUP}])
ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46, 
events=POLLIN}], 6, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
poll([{fd=10, events=POLLOUT}], 1, 0)   = 1 ([{fd=10, 
revents=POLLOUT|POLLHUP}])
ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46, 
events=POLLIN}], 6, {tv_sec=0, tv_nsec=493273904}, NULL, 8) = 0 (Timeout)

Process 31963 detached
 

Can you please give me brief info, what should I debug and how can I do 
that? I'm newbie in gdb debugging.
It is not problem inside the virtual machine (like disk not responding) 
because I can't even get to VNC console and there is no kernel panic 
visible on it. Also I suppose kernel should ping without disk being 
available.


Thank you

With regards
Jan Pekar



On 7.11.2017 00:30, Jason Dillaman wrote:
If you could install the debug packages and get a gdb backtrace from all 
threads it would be helpful. librbd doesn't utilize any QEMU threads so 
even if librbd was deadlocked, the worst case that I would expect would 
be your guest OS complaining about hung kernel tasks related to disk IO 
(since the disk wouldn't be responding).


On Mon, Nov 6, 2017 at 6:02 PM, Jan Pekař - Imatic > wrote:


Hi,

I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
1:2.8+dfsg-6+deb9u3
I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.

When I tested the cluster, I detected strange and severe problem.
On first node I'm running qemu hosts with librados disk connection
to the cluster and all 3 monitors mentioned in connection.
On second node I stopped mon and osd with command

kill -STOP MONPID OSDPID

Within one minute all my qemu hosts on first node freeze, so they
even don't respond to ping. On VNC screen there is no error (disk or
kernel panic), they just hung forever with no console response. Even
starting MON and OSD on stopped host doesn't make them running.
Destroying the qemu domain and running again is the only solution.

This happens even if virtual machine has all primary OSD on other
OSDs from that I have stopped - so it is not writing primary to the
stopped OSD.

If I stop only OSD and MON keep running, or I stop only MON and OSD
keep running everything looks OK.

When I stop MON and OSD, I can see in log  osd.0 1300
heartbeat_check: no reply from ... as usual when OSD fails. During
this are virtuals still running, but after that they all stop.

What should I send you to debug this problem? Without fixing that,
ceph is not reliable to me.

Thank you
With regards
Jan Pekar
Imatic
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Jason


--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-06 Thread Jason Dillaman
If you could install the debug packages and get a gdb backtrace from all
threads it would be helpful. librbd doesn't utilize any QEMU threads so
even if librbd was deadlocked, the worst case that I would expect would be
your guest OS complaining about hung kernel tasks related to disk IO (since
the disk wouldn't be responding).

On Mon, Nov 6, 2017 at 6:02 PM, Jan Pekař - Imatic 
wrote:

> Hi,
>
> I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
> 1:2.8+dfsg-6+deb9u3
> I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.
>
> When I tested the cluster, I detected strange and severe problem.
> On first node I'm running qemu hosts with librados disk connection to the
> cluster and all 3 monitors mentioned in connection.
> On second node I stopped mon and osd with command
>
> kill -STOP MONPID OSDPID
>
> Within one minute all my qemu hosts on first node freeze, so they even
> don't respond to ping. On VNC screen there is no error (disk or kernel
> panic), they just hung forever with no console response. Even starting MON
> and OSD on stopped host doesn't make them running. Destroying the qemu
> domain and running again is the only solution.
>
> This happens even if virtual machine has all primary OSD on other OSDs
> from that I have stopped - so it is not writing primary to the stopped OSD.
>
> If I stop only OSD and MON keep running, or I stop only MON and OSD keep
> running everything looks OK.
>
> When I stop MON and OSD, I can see in log  osd.0 1300 heartbeat_check: no
> reply from ... as usual when OSD fails. During this are virtuals still
> running, but after that they all stop.
>
> What should I send you to debug this problem? Without fixing that, ceph is
> not reliable to me.
>
> Thank you
> With regards
> Jan Pekar
> Imatic
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com