Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

Wido den Hollander Tue, 03 Nov 2015 01:10:23 -0800


On 03-11-15 10:04, Voloshanenko Igor wrote:
> Wido, also minor issue with 0,2.0 java-rados
>


Did you also re-compile CloudStack against the new rados-java? I still
think it's related to when the Agent starts cleaning up and there are
snapshots which need to be unprotected.

In the meantime you might want to remove any existing RBD snapshots
using the RBD commands from Ceph, that might solve the problem.

Wido

> We still catch:
> 
> -storage/ae1b6e5f-f5f4-4abe-aee3-084f2fe71876
> 2015-11-02 11:41:14,958 WARN  [cloud.agent.Agent]
> (agentRequest-Handler-4:null) Caught:
> java.lang.NegativeArraySizeException
> at com.ceph.rbd.RbdImage.snapList(Unknown Source)
> at
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)
> at
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)
> at
> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206)
> at
> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)
> at
> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:57)
> at
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1385)
> at com.cloud.agent.Agent.processRequest(Agent.java:503)
> at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:808)
> at com.cloud.utils.nio.Task.run(Task.java:84)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> Even with updated lib:
> 
> root@ix1-c7-2:/usr/share/cloudstack-agent/lib# ls
> /usr/share/cloudstack-agent/lib | grep rados
> rados-0.2.0.jar
> 
> 2015-11-03 11:01 GMT+02:00 Voloshanenko Igor
> <igor.voloshane...@gmail.com <mailto:igor.voloshane...@gmail.com>>:
> 
>     Wido, it's the main issue. No records at all...
> 
> 
>     So, from last time:
> 
> 
>     2015-11-02 11:40:33,204 DEBUG
>     [kvm.resource.LibvirtComputingResource]
>     (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep
>     Mem:|awk '{print $2}'
>     2015-11-02 11:40:33,207 DEBUG
>     [kvm.resource.LibvirtComputingResource]
>     (agentRequest-Handler-2:null) Execution is successful.
>     2015-11-02 11:40:35,316 DEBUG [cloud.agent.Agent]
>     (agentRequest-Handler-4:null) Processing command:
>     com.cloud.agent.api.GetVmStatsCommand
>     2015-11-02 11:40:35,867 INFO  [cloud.agent.AgentShell] (main:null)
>     Agent started
>     2015-11-02 11:40:35,868 INFO  [cloud.agent.AgentShell] (main:null)
>     Implementation Version is 4.5.1
> 
>     So, almost alsways it's exception after RbdUnprotect then in approx
>     . 20 minutes - crash..
>     Almost all the time - it's happen after GetVmStatsCommand or Disks
>     stats... Possible that evil hiden into UpadteDiskInfo method... but
>     i can;t find any bad code there (((
> 
>     2015-11-03 10:40 GMT+02:00 Wido den Hollander <w...@42on.com
>     <mailto:w...@42on.com>>:
> 
> 
> 
>         On 03-11-15 01:54, Voloshanenko Igor wrote:
>         > Thank you, Jason!
>         >
>         > Any advice, for troubleshooting....
>         >
>         > I'm looking in code, and right now don;t see any bad things :(
>         >
> 
>         Can you run the CloudStack Agent in DEBUG mode and then see
>         after which
>         lines in the logs it crashes?
> 
>         Wido
> 
>         > 2015-11-03 1:32 GMT+02:00 Jason Dillaman <dilla...@redhat.com 
> <mailto:dilla...@redhat.com>
>         > <mailto:dilla...@redhat.com <mailto:dilla...@redhat.com>>>:
>         >
>         >     Most likely not going to be related to 13045 since you aren't
>         >     actively exporting an image diff.  The most likely problem is 
> that
>         >     the RADOS IO context is being closed prior to closing the RBD 
> image.
>         >
>         >     --
>         >
>         >     Jason Dillaman
>         >
>         >
>         >     ----- Original Message -----
>         >
>         >     > From: "Voloshanenko Igor" <igor.voloshane...@gmail.com 
> <mailto:igor.voloshane...@gmail.com>
>         >     <mailto:igor.voloshane...@gmail.com
>         <mailto:igor.voloshane...@gmail.com>>>
>         >     > To: "Ceph Users" <ceph-users@lists.ceph.com
>         <mailto:ceph-users@lists.ceph.com>
>         >     <mailto:ceph-users@lists.ceph.com
>         <mailto:ceph-users@lists.ceph.com>>>
>         >     > Sent: Thursday, October 29, 2015 5:27:17 PM
>         >     > Subject: Re: [ceph-users] Cloudstack agent crashed JVM with
>         >     exception in
>         >     > librbd
>         >
>         >     > From all we analyzed - look like - it's this issue
>         >     > http://tracker.ceph.com/issues/13045
>         >
>         >     > PR: https://github.com/ceph/ceph/pull/6097
>         >
>         >     > Can anyone help us to confirm this? :)
>         >
>         >     > 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor <
>         >     igor.voloshane...@gmail.com
>         <mailto:igor.voloshane...@gmail.com>
>         <mailto:igor.voloshane...@gmail.com
>         <mailto:igor.voloshane...@gmail.com>> >
>         >     > :
>         >
>         >     > > Additional trace:
>         >     >
>         >
>         >     > > #0 0x00007f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
>         >     > > ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>         >     >
>         >     > > #1 0x00007f30f98950d8 in __GI_abort () at abort.c:89
>         >     >
>         >     > > #2 0x00007f30f87b36b5 in
>         >     __gnu_cxx::__verbose_terminate_handler() () from
>         >     > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>         >     >
>         >     > > #3 0x00007f30f87b1836 in ?? () from
>         >     > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>         >     >
>         >     > > #4 0x00007f30f87b1863 in std::terminate() () from
>         >     > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>         >     >
>         >     > > #5 0x00007f30f87b1aa2 in __cxa_throw () from
>         >     > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>         >     >
>         >     > > #6 0x00007f2fddb50778 in ceph::__ceph_assert_fail
>         >     > > (assertion=assertion@entry=0x7f2fdddeca05 "sub <
>         m_subsys.size()",
>         >     >
>         >     > > file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h",
>         >     line=line@entry=62,
>         >     >
>         >     > > func=func@entry=0x7f2fdddedba0
>         >     > >
>         >   
>          <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__>
>         "bool
>         >     > > ceph::log::SubsystemMap::should_gather(unsigned int,
>         int)") at
>         >     > > common/assert.cc:77
>         >     >
>         >     > > #7 0x00007f2fdda1fed2 in
>         ceph::log::SubsystemMap::should_gather
>         >     > > (level=<optimized out>, sub=<optimized out>,
>         this=<optimized out>)
>         >     >
>         >     > > at ./log/SubsystemMap.h:62
>         >     >
>         >     > > #8 0x00007f2fdda3b693 in
>         ceph::log::SubsystemMap::should_gather
>         >     > > (this=<optimized out>, sub=<optimized out>,
>         level=<optimized out>)
>         >     >
>         >     > > at ./log/SubsystemMap.h:61
>         >     >
>         >     > > #9 0x00007f2fddd879be in ObjectCacher::flusher_entry
>         >     (this=0x7f2ff80b27a0)
>         >     > > at
>         >     > > osdc/ObjectCacher.cc:1527
>         >     >
>         >     > > #10 0x00007f2fddd9851d in
>         ObjectCacher::FlusherThread::entry
>         >     > > (this=<optimized
>         >     > > out>) at osdc/ObjectCacher.h:374
>         >     >
>         >     > > #11 0x00007f30f9c28182 in start_thread
>         (arg=0x7f2e1a7fc700) at
>         >     > > pthread_create.c:312
>         >     >
>         >     > > #12 0x00007f30f995547d in clone () at
>         >     > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>         >     >
>         >
>         >     > > 2015-10-29 17:38 GMT+02:00 Voloshanenko Igor <
>         >     igor.voloshane...@gmail.com
>         <mailto:igor.voloshane...@gmail.com>
>         <mailto:igor.voloshane...@gmail.com
>         <mailto:igor.voloshane...@gmail.com>>
>         >     > > >
>         >     > > :
>         >     >
>         >
>         >     > > > Hi Wido and all community.
>         >     > >
>         >     >
>         >
>         >     > > > We catched very idiotic issue on our Cloudstack
>         installation,
>         >     which
>         >     > > > related
>         >     > > > to ceph and possible to java-rados lib.
>         >     > >
>         >     >
>         >
>         >     > > > So, we have constantly agent crashed (which cause
>         very big
>         >     problem for
>         >     > > > us...
>         >     > > > ).
>         >     > >
>         >     >
>         >
>         >     > > > When agent crashed - it's crash JVM. And no event in
>         logs at all.
>         >     > >
>         >     >
>         >     > > > We enabled crush dump, and after crash we see next
>         picture:
>         >     > >
>         >     >
>         >
>         >     > > > #grep -A1 "Problematic frame" < /hs_err_pid30260.log
>         >     > >
>         >     >
>         >     > > > Problematic frame:
>         >     > >
>         >     >
>         >     > > > C [librbd.so.1.0.0+0x5d681]
>         >     > >
>         >     >
>         >
>         >     > > > # gdb /usr/lib/librbd.so.1.0.0
>         /var/tmp/cores/jsvc.25526.0.core
>         >     > >
>         >     >
>         >     > > > (gdb) bt
>         >     > >
>         >     >
>         >     > > > ...
>         >     > >
>         >     >
>         >     > > > #7 0x00007f30b9a1fed2 in
>         ceph::log::SubsystemMap::should_gather
>         >     > > > (level=<optimized out>, sub=<optimized out>,
>         this=<optimized out>)
>         >     > >
>         >     >
>         >     > > > at ./log/SubsystemMap.h:62
>         >     > >
>         >     >
>         >     > > > #8 0x00007f30b9a3b693 in
>         ceph::log::SubsystemMap::should_gather
>         >     > > > (this=<optimized out>, sub=<optimized out>,
>         level=<optimized out>)
>         >     > >
>         >     >
>         >     > > > at ./log/SubsystemMap.h:61
>         >     > >
>         >     >
>         >     > > > #9 0x00007f30b9d879be in ObjectCacher::flusher_entry
>         >     > > > (this=0x7f2fb4017910)
>         >     > > > at
>         >     > > > osdc/ObjectCacher.cc:1527
>         >     > >
>         >     >
>         >     > > > #10 0x00007f30b9d9851d in
>         ObjectCacher::FlusherThread::entry
>         >     > > > (this=<optimized
>         >     > > > out>) at osdc/ObjectCacher.h:374
>         >     > >
>         >     >
>         >
>         >     > > > From ceph code, this part executed when flushing cache
>         >     object... And we
>         >     > > > don;t
>         >     > > > understand why. Becasue we have absolutely different
>         race
>         >     condition to
>         >     > > > reproduce it.
>         >     > >
>         >     >
>         >
>         >     > > > As cloudstack have not good implementation yet of
>         snapshot
>         >     lifecycle,
>         >     > > > sometime, it's happen, that some volumes already
>         marked as
>         >     EXPUNGED in DB
>         >     > > > and then cloudstack try to delete bas Volume, before
>         it's try to
>         >     > > > unprotect
>         >     > > > it.
>         >     > >
>         >     >
>         >
>         >     > > > Sure, unprotecting fail, normal exception returned
>         back (fail
>         >     because
>         >     > > > snap
>         >     > > > has childs... )
>         >     > >
>         >     >
>         >
>         >     > > > 2015-10-29 09:02:19,401 DEBUG
>         [kvm.resource.KVMHAMonitor]
>         >     > > > (Thread-1304:null)
>         >     > > > Executing:
>         >     > > >
>         >   
>          
> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
>         >     -i
>         >     > > > 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
>         >     > > > /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h
>         10.44.253.11
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:19,412 DEBUG
>         [kvm.resource.KVMHAMonitor]
>         >     > > > (Thread-1304:null)
>         >     > > > Execution is successful.
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:19,554 INFO
>         [kvm.storage.LibvirtStorageAdaptor]
>         >     > > > (agentRequest-Handler-5:null) Unprotecting and
>         Removing RBD
>         >     snapshots of
>         >     > > > image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to
>         >     removing the
>         >     > > > image
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:19,571 DEBUG
>         [kvm.storage.LibvirtStorageAdaptor]
>         >     > > > (agentRequest-Handler-5:null) Succesfully connected
>         to Ceph
>         >     cluster at
>         >     > > > cephmon.anolim.net:6789
>         <http://cephmon.anolim.net:6789> <http://cephmon.anolim.net:6789>
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:19,608 DEBUG
>         [kvm.storage.LibvirtStorageAdaptor]
>         >     > > > (agentRequest-Handler-5:null) Unprotecting snapshot
>         >     > > >
>         >   
>          cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:19,627 DEBUG
>         [kvm.storage.KVMStorageProcessor]
>         >     > > > (agentRequest-Handler-5:null) Failed to delete volume:
>         >     > > > com.cloud.utils.exception.CloudRuntimeException:
>         >     > > > com.ceph.rbd.RbdException:
>         >     > > > Failed to unprotect snapshot cloudstack-base-snap
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
>         >     > > > (agentRequest-Handler-5:null) Seq 4-1921583831: {
>         Ans: , MgmtId:
>         >     > > > 161344838950, via: 4, Ver: v1, Flags: 10,
>         >     > > >
>         >   
>          
> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
>         >     > > > com.ceph.rbd.RbdException: Failed to unprotect snapshot
>         >     > > > cloudstack-base-snap","wait":0}}] }
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]
>         >     > > > (agentRequest-Handler-2:null) Processing command:
>         >     > > > com.cloud.agent.api.GetHostStatsCommand
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:25,722 DEBUG
>         >     [kvm.resource.LibvirtComputingResource]
>         >     > > > (agentRequest-Handler-2:null) Executing: /bin/bash -c
>         >     idle=$(top -b -n 1|
>         >     > > > awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,"");
>         print }');
>         >     echo $idle
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:26,249 DEBUG
>         >     [kvm.resource.LibvirtComputingResource]
>         >     > > > (agentRequest-Handler-2:null) Execution is successful.
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:26,250 DEBUG
>         >     [kvm.resource.LibvirtComputingResource]
>         >     > > > (agentRequest-Handler-2:null) Executing: /bin/bash -c
>         >     freeMem=$(free|grep
>         >     > > > cache:|awk '{print $4}');echo $freeMem
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:02:26,254 DEBUG
>         >     [kvm.resource.LibvirtComputingResource]
>         >     > > > (agentRequest-Handler-2:null) Execution is successful.
>         >     > >
>         >     >
>         >
>         >     > > > BUT, after 20 minutes - agent crashed... If we
>         remove all
>         >     childs and
>         >     > > > create
>         >     > > > conditions for cloudstack to delete volume - all OK
>         - no agent
>         >     crash in
>         >     > > > 20
>         >     > > > minutes...
>         >     > >
>         >     >
>         >
>         >     > > > We can't connect this action - Volume delete with agent
>         >     crashe... Also we
>         >     > > > don't understand why +- 20 minutes need to last, and
>         only then
>         >     agent
>         >     > > > crashed...
>         >     > >
>         >     >
>         >
>         >     > > > From logs, before crash - only GetVMStats... And
>         then - agent
>         >     started...
>         >     > >
>         >     >
>         >
>         >     > > > 2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent]
>         >     (UgentTask-5:null)
>         >     > > > Sending
>         >     > > > ping: Seq 4-1343: { Cmd , MgmtId: -1, via: 4, Ver:
>         v1, Flags: 11,
>         >     > > >
>         >   
>          
> [{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":"
>         >     > > > cs2.anolim.net <http://cs2.anolim.net>
>         <http://cs2.anolim.net>
>         >     "},"i-7-106-VM":{"state":"PowerOn","host":" cs2.anolim.net 
> <http://cs2.anolim.net>
>         >     <http://cs2.anolim.net>
>         >     > > > "},"i-1683-1984-VM":{"state":"PowerOn","host":" 
> cs2.anolim.net <http://cs2.anolim.net>
>         >     <http://cs2.anolim.net>
>         >     > > > "},"i-11-504-VM":{"state":"PowerOn","host":" 
> cs2.anolim.net <http://cs2.anolim.net>
>         >     <http://cs2.anolim.net>
>         >     > > > "},"i-325-616-VM":{"state":"PowerOn","host":" 
> cs2.anolim.net <http://cs2.anolim.net>
>         >     <http://cs2.anolim.net>
>         >     > > > "},"i-10-52-VM":{"state":"PowerOn","host":" 
> cs2.anolim.net <http://cs2.anolim.net>
>         >     <http://cs2.anolim.net>
>         >     > > > "},"i-941-1237-VM":{"state":"PowerOn","host":" 
> cs2.anolim.net <http://cs2.anolim.net>
>         >     <http://cs2.anolim.net>
>         >     > > >
>         >   
>          
> "}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":4,"wait":0}}]
>         >     > > > }
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:21:55,149 DEBUG [cloud.agent.Agent]
>         >     (Agent-Handler-3:null)
>         >     > > > Received response: Seq 4-1343: { Ans: , MgmtId:
>         161344838950,
>         >     via: 4,
>         >     > > > Ver:
>         >     > > > v1, Flags: 100010,
>         >     > > >
>         >   
>          
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":4,"wait":0},"result":true,"wait":0}}]
>         >     > > > }
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:22:20,755 DEBUG [cloud.agent.Agent]
>         >     > > > (agentRequest-Handler-5:null) Processing command:
>         >     > > > com.cloud.agent.api.GetVmStatsCommand
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:23:02,630 INFO [cloud.agent.AgentShell]
>         >     (main:null) Agent
>         >     > > > started
>         >     > >
>         >     >
>         >     > > > 2015-10-29 09:23:02,632 INFO
>         [cloud.agent.AgentShell] (main:null)
>         >     > > > Implementation Version is 4.3.2
>         >     > >
>         >     >
>         >
>         >     > > > Can you please give us direction to go deeper... As
>         we dont
>         >     udnerstand
>         >     > > > what
>         >     > > > happened...
>         >     > >
>         >     >
>         >
>         >     > > > Issue reproducing very easy:
>         >     > >
>         >     >
>         >
>         >     > > > (to speed up delete process we set
>         storage.cleanup.interval=10)
>         >     > >
>         >     >
>         >
>         >     > > > 1. Create Volume
>         >     > >
>         >     >
>         >     > > > 2. Attach it to VM
>         >     > >
>         >     >
>         >     > > > 3. Create snapshot from this volume by hand on ceph
>         >     > >
>         >     >
>         >     > > > 4. Do snap_protect by hand on ceph for snap from
>         previous step
>         >     > >
>         >     >
>         >     > > > 5. Create child via snap clone
>         >     > >
>         >     >
>         >     > > > 6. Deattach volume from VM and delete it
>         >     > >
>         >     >
>         >
>         >     > > > Our problem - that this happen not in such lab
>         situation, but
>         >     in real
>         >     > > > life...
>         >     > >
>         >     >
>         >
>         >     > _______________________________________________
>         >     > ceph-users mailing list
>         >     > ceph-users@lists.ceph.com
>         <mailto:ceph-users@lists.ceph.com>
>         <mailto:ceph-users@lists.ceph.com
>         <mailto:ceph-users@lists.ceph.com>>
>         >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>         >
>         >
>         >
>         >
>         > _______________________________________________
>         > ceph-users mailing list
>         > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>         > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>         >
>         _______________________________________________
>         ceph-users mailing list
>         ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

Reply via email to