Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

Voloshanenko Igor Tue, 03 Nov 2015 01:03:13 -0800

Wido, it's the main issue. No records at all...


So, from last time:


2015-11-02 11:40:33,204 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep Mem:|awk
'{print $2}'
2015-11-02 11:40:33,207 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:null) Execution is successful.
2015-11-02 11:40:35,316 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-4:null) Processing command:
com.cloud.agent.api.GetVmStatsCommand
2015-11-02 11:40:35,867 INFO  [cloud.agent.AgentShell] (main:null) Agent
started
2015-11-02 11:40:35,868 INFO  [cloud.agent.AgentShell] (main:null)
Implementation Version is 4.5.1

So, almost alsways it's exception after RbdUnprotect then in approx . 20
minutes - crash..
Almost all the time - it's happen after GetVmStatsCommand or Disks stats...
Possible that evil hiden into UpadteDiskInfo method... but i can;t find any
bad code there (((

2015-11-03 10:40 GMT+02:00 Wido den Hollander <w...@42on.com>:

>
>
> On 03-11-15 01:54, Voloshanenko Igor wrote:
> > Thank you, Jason!
> >
> > Any advice, for troubleshooting....
> >
> > I'm looking in code, and right now don;t see any bad things :(
> >
>
> Can you run the CloudStack Agent in DEBUG mode and then see after which
> lines in the logs it crashes?
>
> Wido
>
> > 2015-11-03 1:32 GMT+02:00 Jason Dillaman <dilla...@redhat.com
> > <mailto:dilla...@redhat.com>>:
> >
> >     Most likely not going to be related to 13045 since you aren't
> >     actively exporting an image diff.  The most likely problem is that
> >     the RADOS IO context is being closed prior to closing the RBD image.
> >
> >     --
> >
> >     Jason Dillaman
> >
> >
> >     ----- Original Message -----
> >
> >     > From: "Voloshanenko Igor" <igor.voloshane...@gmail.com
> >     <mailto:igor.voloshane...@gmail.com>>
> >     > To: "Ceph Users" <ceph-users@lists.ceph.com
> >     <mailto:ceph-users@lists.ceph.com>>
> >     > Sent: Thursday, October 29, 2015 5:27:17 PM
> >     > Subject: Re: [ceph-users] Cloudstack agent crashed JVM with
> >     exception in
> >     > librbd
> >
> >     > From all we analyzed - look like - it's this issue
> >     > http://tracker.ceph.com/issues/13045
> >
> >     > PR: https://github.com/ceph/ceph/pull/6097
> >
> >     > Can anyone help us to confirm this? :)
> >
> >     > 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor <
> >     igor.voloshane...@gmail.com <mailto:igor.voloshane...@gmail.com> >
> >     > :
> >
> >     > > Additional trace:
> >     >
> >
> >     > > #0 0x00007f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
> >     > > ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> >     >
> >     > > #1 0x00007f30f98950d8 in __GI_abort () at abort.c:89
> >     >
> >     > > #2 0x00007f30f87b36b5 in
> >     __gnu_cxx::__verbose_terminate_handler() () from
> >     > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >     >
> >     > > #3 0x00007f30f87b1836 in ?? () from
> >     > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >     >
> >     > > #4 0x00007f30f87b1863 in std::terminate() () from
> >     > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >     >
> >     > > #5 0x00007f30f87b1aa2 in __cxa_throw () from
> >     > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >     >
> >     > > #6 0x00007f2fddb50778 in ceph::__ceph_assert_fail
> >     > > (assertion=assertion@entry=0x7f2fdddeca05 "sub <
> m_subsys.size()",
> >     >
> >     > > file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h",
> >     line=line@entry=62,
> >     >
> >     > > func=func@entry=0x7f2fdddedba0
> >     > >
> >
>  <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool
> >     > > ceph::log::SubsystemMap::should_gather(unsigned int, int)") at
> >     > > common/assert.cc:77
> >     >
> >     > > #7 0x00007f2fdda1fed2 in ceph::log::SubsystemMap::should_gather
> >     > > (level=<optimized out>, sub=<optimized out>, this=<optimized
> out>)
> >     >
> >     > > at ./log/SubsystemMap.h:62
> >     >
> >     > > #8 0x00007f2fdda3b693 in ceph::log::SubsystemMap::should_gather
> >     > > (this=<optimized out>, sub=<optimized out>, level=<optimized
> out>)
> >     >
> >     > > at ./log/SubsystemMap.h:61
> >     >
> >     > > #9 0x00007f2fddd879be in ObjectCacher::flusher_entry
> >     (this=0x7f2ff80b27a0)
> >     > > at
> >     > > osdc/ObjectCacher.cc:1527
> >     >
> >     > > #10 0x00007f2fddd9851d in ObjectCacher::FlusherThread::entry
> >     > > (this=<optimized
> >     > > out>) at osdc/ObjectCacher.h:374
> >     >
> >     > > #11 0x00007f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at
> >     > > pthread_create.c:312
> >     >
> >     > > #12 0x00007f30f995547d in clone () at
> >     > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> >     >
> >
> >     > > 2015-10-29 17:38 GMT+02:00 Voloshanenko Igor <
> >     igor.voloshane...@gmail.com <mailto:igor.voloshane...@gmail.com>
> >     > > >
> >     > > :
> >     >
> >
> >     > > > Hi Wido and all community.
> >     > >
> >     >
> >
> >     > > > We catched very idiotic issue on our Cloudstack installation,
> >     which
> >     > > > related
> >     > > > to ceph and possible to java-rados lib.
> >     > >
> >     >
> >
> >     > > > So, we have constantly agent crashed (which cause very big
> >     problem for
> >     > > > us...
> >     > > > ).
> >     > >
> >     >
> >
> >     > > > When agent crashed - it's crash JVM. And no event in logs at
> all.
> >     > >
> >     >
> >     > > > We enabled crush dump, and after crash we see next picture:
> >     > >
> >     >
> >
> >     > > > #grep -A1 "Problematic frame" < /hs_err_pid30260.log
> >     > >
> >     >
> >     > > > Problematic frame:
> >     > >
> >     >
> >     > > > C [librbd.so.1.0.0+0x5d681]
> >     > >
> >     >
> >
> >     > > > # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
> >     > >
> >     >
> >     > > > (gdb) bt
> >     > >
> >     >
> >     > > > ...
> >     > >
> >     >
> >     > > > #7 0x00007f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
> >     > > > (level=<optimized out>, sub=<optimized out>, this=<optimized
> out>)
> >     > >
> >     >
> >     > > > at ./log/SubsystemMap.h:62
> >     > >
> >     >
> >     > > > #8 0x00007f30b9a3b693 in ceph::log::SubsystemMap::should_gather
> >     > > > (this=<optimized out>, sub=<optimized out>, level=<optimized
> out>)
> >     > >
> >     >
> >     > > > at ./log/SubsystemMap.h:61
> >     > >
> >     >
> >     > > > #9 0x00007f30b9d879be in ObjectCacher::flusher_entry
> >     > > > (this=0x7f2fb4017910)
> >     > > > at
> >     > > > osdc/ObjectCacher.cc:1527
> >     > >
> >     >
> >     > > > #10 0x00007f30b9d9851d in ObjectCacher::FlusherThread::entry
> >     > > > (this=<optimized
> >     > > > out>) at osdc/ObjectCacher.h:374
> >     > >
> >     >
> >
> >     > > > From ceph code, this part executed when flushing cache
> >     object... And we
> >     > > > don;t
> >     > > > understand why. Becasue we have absolutely different race
> >     condition to
> >     > > > reproduce it.
> >     > >
> >     >
> >
> >     > > > As cloudstack have not good implementation yet of snapshot
> >     lifecycle,
> >     > > > sometime, it's happen, that some volumes already marked as
> >     EXPUNGED in DB
> >     > > > and then cloudstack try to delete bas Volume, before it's try
> to
> >     > > > unprotect
> >     > > > it.
> >     > >
> >     >
> >
> >     > > > Sure, unprotecting fail, normal exception returned back (fail
> >     because
> >     > > > snap
> >     > > > has childs... )
> >     > >
> >     >
> >
> >     > > > 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]
> >     > > > (Thread-1304:null)
> >     > > > Executing:
> >     > > >
> >
>  /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
> >     -i
> >     > > > 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
> >     > > > /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]
> >     > > > (Thread-1304:null)
> >     > > > Execution is successful.
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:19,554 INFO
> [kvm.storage.LibvirtStorageAdaptor]
> >     > > > (agentRequest-Handler-5:null) Unprotecting and Removing RBD
> >     snapshots of
> >     > > > image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to
> >     removing the
> >     > > > image
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:19,571 DEBUG
> [kvm.storage.LibvirtStorageAdaptor]
> >     > > > (agentRequest-Handler-5:null) Succesfully connected to Ceph
> >     cluster at
> >     > > > cephmon.anolim.net:6789 <http://cephmon.anolim.net:6789>
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:19,608 DEBUG
> [kvm.storage.LibvirtStorageAdaptor]
> >     > > > (agentRequest-Handler-5:null) Unprotecting snapshot
> >     > > >
> >     cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]
> >     > > > (agentRequest-Handler-5:null) Failed to delete volume:
> >     > > > com.cloud.utils.exception.CloudRuntimeException:
> >     > > > com.ceph.rbd.RbdException:
> >     > > > Failed to unprotect snapshot cloudstack-base-snap
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
> >     > > > (agentRequest-Handler-5:null) Seq 4-1921583831: { Ans: ,
> MgmtId:
> >     > > > 161344838950, via: 4, Ver: v1, Flags: 10,
> >     > > >
> >
>  
> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
> >     > > > com.ceph.rbd.RbdException: Failed to unprotect snapshot
> >     > > > cloudstack-base-snap","wait":0}}] }
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]
> >     > > > (agentRequest-Handler-2:null) Processing command:
> >     > > > com.cloud.agent.api.GetHostStatsCommand
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:25,722 DEBUG
> >     [kvm.resource.LibvirtComputingResource]
> >     > > > (agentRequest-Handler-2:null) Executing: /bin/bash -c
> >     idle=$(top -b -n 1|
> >     > > > awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }');
> >     echo $idle
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:26,249 DEBUG
> >     [kvm.resource.LibvirtComputingResource]
> >     > > > (agentRequest-Handler-2:null) Execution is successful.
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:26,250 DEBUG
> >     [kvm.resource.LibvirtComputingResource]
> >     > > > (agentRequest-Handler-2:null) Executing: /bin/bash -c
> >     freeMem=$(free|grep
> >     > > > cache:|awk '{print $4}');echo $freeMem
> >     > >
> >     >
> >     > > > 2015-10-29 09:02:26,254 DEBUG
> >     [kvm.resource.LibvirtComputingResource]
> >     > > > (agentRequest-Handler-2:null) Execution is successful.
> >     > >
> >     >
> >
> >     > > > BUT, after 20 minutes - agent crashed... If we remove all
> >     childs and
> >     > > > create
> >     > > > conditions for cloudstack to delete volume - all OK - no agent
> >     crash in
> >     > > > 20
> >     > > > minutes...
> >     > >
> >     >
> >
> >     > > > We can't connect this action - Volume delete with agent
> >     crashe... Also we
> >     > > > don't understand why +- 20 minutes need to last, and only then
> >     agent
> >     > > > crashed...
> >     > >
> >     >
> >
> >     > > > From logs, before crash - only GetVMStats... And then - agent
> >     started...
> >     > >
> >     >
> >
> >     > > > 2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent]
> >     (UgentTask-5:null)
> >     > > > Sending
> >     > > > ping: Seq 4-1343: { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags:
> 11,
> >     > > >
> >
>  
> [{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":"
> >     > > > cs2.anolim.net <http://cs2.anolim.net>
> >     "},"i-7-106-VM":{"state":"PowerOn","host":" cs2.anolim.net
> >     <http://cs2.anolim.net>
> >     > > > "},"i-1683-1984-VM":{"state":"PowerOn","host":" cs2.anolim.net
> >     <http://cs2.anolim.net>
> >     > > > "},"i-11-504-VM":{"state":"PowerOn","host":" cs2.anolim.net
> >     <http://cs2.anolim.net>
> >     > > > "},"i-325-616-VM":{"state":"PowerOn","host":" cs2.anolim.net
> >     <http://cs2.anolim.net>
> >     > > > "},"i-10-52-VM":{"state":"PowerOn","host":" cs2.anolim.net
> >     <http://cs2.anolim.net>
> >     > > > "},"i-941-1237-VM":{"state":"PowerOn","host":" cs2.anolim.net
> >     <http://cs2.anolim.net>
> >     > > >
> >
>  
> "}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":4,"wait":0}}]
> >     > > > }
> >     > >
> >     >
> >     > > > 2015-10-29 09:21:55,149 DEBUG [cloud.agent.Agent]
> >     (Agent-Handler-3:null)
> >     > > > Received response: Seq 4-1343: { Ans: , MgmtId: 161344838950,
> >     via: 4,
> >     > > > Ver:
> >     > > > v1, Flags: 100010,
> >     > > >
> >
>  
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":4,"wait":0},"result":true,"wait":0}}]
> >     > > > }
> >     > >
> >     >
> >     > > > 2015-10-29 09:22:20,755 DEBUG [cloud.agent.Agent]
> >     > > > (agentRequest-Handler-5:null) Processing command:
> >     > > > com.cloud.agent.api.GetVmStatsCommand
> >     > >
> >     >
> >     > > > 2015-10-29 09:23:02,630 INFO [cloud.agent.AgentShell]
> >     (main:null) Agent
> >     > > > started
> >     > >
> >     >
> >     > > > 2015-10-29 09:23:02,632 INFO [cloud.agent.AgentShell]
> (main:null)
> >     > > > Implementation Version is 4.3.2
> >     > >
> >     >
> >
> >     > > > Can you please give us direction to go deeper... As we dont
> >     udnerstand
> >     > > > what
> >     > > > happened...
> >     > >
> >     >
> >
> >     > > > Issue reproducing very easy:
> >     > >
> >     >
> >
> >     > > > (to speed up delete process we set storage.cleanup.interval=10)
> >     > >
> >     >
> >
> >     > > > 1. Create Volume
> >     > >
> >     >
> >     > > > 2. Attach it to VM
> >     > >
> >     >
> >     > > > 3. Create snapshot from this volume by hand on ceph
> >     > >
> >     >
> >     > > > 4. Do snap_protect by hand on ceph for snap from previous step
> >     > >
> >     >
> >     > > > 5. Create child via snap clone
> >     > >
> >     >
> >     > > > 6. Deattach volume from VM and delete it
> >     > >
> >     >
> >
> >     > > > Our problem - that this happen not in such lab situation, but
> >     in real
> >     > > > life...
> >     > >
> >     >
> >
> >     > _______________________________________________
> >     > ceph-users mailing list
> >     > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

Reply via email to