Wido, also minor issue with 0,2.0 java-rados We still catch:
-storage/ae1b6e5f-f5f4-4abe-aee3-084f2fe71876 2015-11-02 11:41:14,958 WARN [cloud.agent.Agent] (agentRequest-Handler-4:null) Caught: java.lang.NegativeArraySizeException at com.ceph.rbd.RbdImage.snapList(Unknown Source) at com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854) at com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175) at com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206) at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124) at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:57) at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1385) at com.cloud.agent.Agent.processRequest(Agent.java:503) at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:808) at com.cloud.utils.nio.Task.run(Task.java:84) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Even with updated lib: root@ix1-c7-2:/usr/share/cloudstack-agent/lib# ls /usr/share/cloudstack-agent/lib | grep rados rados-0.2.0.jar 2015-11-03 11:01 GMT+02:00 Voloshanenko Igor <igor.voloshane...@gmail.com>: > Wido, it's the main issue. No records at all... > > > So, from last time: > > > 2015-11-02 11:40:33,204 DEBUG [kvm.resource.LibvirtComputingResource] > (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep Mem:|awk > '{print $2}' > 2015-11-02 11:40:33,207 DEBUG [kvm.resource.LibvirtComputingResource] > (agentRequest-Handler-2:null) Execution is successful. > 2015-11-02 11:40:35,316 DEBUG [cloud.agent.Agent] > (agentRequest-Handler-4:null) Processing command: > com.cloud.agent.api.GetVmStatsCommand > 2015-11-02 11:40:35,867 INFO [cloud.agent.AgentShell] (main:null) Agent > started > 2015-11-02 11:40:35,868 INFO [cloud.agent.AgentShell] (main:null) > Implementation Version is 4.5.1 > > So, almost alsways it's exception after RbdUnprotect then in approx . 20 > minutes - crash.. > Almost all the time - it's happen after GetVmStatsCommand or Disks > stats... Possible that evil hiden into UpadteDiskInfo method... but i can;t > find any bad code there ((( > > 2015-11-03 10:40 GMT+02:00 Wido den Hollander <w...@42on.com>: > >> >> >> On 03-11-15 01:54, Voloshanenko Igor wrote: >> > Thank you, Jason! >> > >> > Any advice, for troubleshooting.... >> > >> > I'm looking in code, and right now don;t see any bad things :( >> > >> >> Can you run the CloudStack Agent in DEBUG mode and then see after which >> lines in the logs it crashes? >> >> Wido >> >> > 2015-11-03 1:32 GMT+02:00 Jason Dillaman <dilla...@redhat.com >> > <mailto:dilla...@redhat.com>>: >> > >> > Most likely not going to be related to 13045 since you aren't >> > actively exporting an image diff. The most likely problem is that >> > the RADOS IO context is being closed prior to closing the RBD image. >> > >> > -- >> > >> > Jason Dillaman >> > >> > >> > ----- Original Message ----- >> > >> > > From: "Voloshanenko Igor" <igor.voloshane...@gmail.com >> > <mailto:igor.voloshane...@gmail.com>> >> > > To: "Ceph Users" <ceph-users@lists.ceph.com >> > <mailto:ceph-users@lists.ceph.com>> >> > > Sent: Thursday, October 29, 2015 5:27:17 PM >> > > Subject: Re: [ceph-users] Cloudstack agent crashed JVM with >> > exception in >> > > librbd >> > >> > > From all we analyzed - look like - it's this issue >> > > http://tracker.ceph.com/issues/13045 >> > >> > > PR: https://github.com/ceph/ceph/pull/6097 >> > >> > > Can anyone help us to confirm this? :) >> > >> > > 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor < >> > igor.voloshane...@gmail.com <mailto:igor.voloshane...@gmail.com> > >> > > : >> > >> > > > Additional trace: >> > > >> > >> > > > #0 0x00007f30f9891cc9 in __GI_raise (sig=sig@entry=6) at >> > > > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 >> > > >> > > > #1 0x00007f30f98950d8 in __GI_abort () at abort.c:89 >> > > >> > > > #2 0x00007f30f87b36b5 in >> > __gnu_cxx::__verbose_terminate_handler() () from >> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> > > >> > > > #3 0x00007f30f87b1836 in ?? () from >> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> > > >> > > > #4 0x00007f30f87b1863 in std::terminate() () from >> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> > > >> > > > #5 0x00007f30f87b1aa2 in __cxa_throw () from >> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> > > >> > > > #6 0x00007f2fddb50778 in ceph::__ceph_assert_fail >> > > > (assertion=assertion@entry=0x7f2fdddeca05 "sub < >> m_subsys.size()", >> > > >> > > > file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h", >> > line=line@entry=62, >> > > >> > > > func=func@entry=0x7f2fdddedba0 >> > > > >> > >> <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool >> > > > ceph::log::SubsystemMap::should_gather(unsigned int, int)") at >> > > > common/assert.cc:77 >> > > >> > > > #7 0x00007f2fdda1fed2 in ceph::log::SubsystemMap::should_gather >> > > > (level=<optimized out>, sub=<optimized out>, this=<optimized >> out>) >> > > >> > > > at ./log/SubsystemMap.h:62 >> > > >> > > > #8 0x00007f2fdda3b693 in ceph::log::SubsystemMap::should_gather >> > > > (this=<optimized out>, sub=<optimized out>, level=<optimized >> out>) >> > > >> > > > at ./log/SubsystemMap.h:61 >> > > >> > > > #9 0x00007f2fddd879be in ObjectCacher::flusher_entry >> > (this=0x7f2ff80b27a0) >> > > > at >> > > > osdc/ObjectCacher.cc:1527 >> > > >> > > > #10 0x00007f2fddd9851d in ObjectCacher::FlusherThread::entry >> > > > (this=<optimized >> > > > out>) at osdc/ObjectCacher.h:374 >> > > >> > > > #11 0x00007f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at >> > > > pthread_create.c:312 >> > > >> > > > #12 0x00007f30f995547d in clone () at >> > > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >> > > >> > >> > > > 2015-10-29 17:38 GMT+02:00 Voloshanenko Igor < >> > igor.voloshane...@gmail.com <mailto:igor.voloshane...@gmail.com> >> > > > > >> > > > : >> > > >> > >> > > > > Hi Wido and all community. >> > > > >> > > >> > >> > > > > We catched very idiotic issue on our Cloudstack installation, >> > which >> > > > > related >> > > > > to ceph and possible to java-rados lib. >> > > > >> > > >> > >> > > > > So, we have constantly agent crashed (which cause very big >> > problem for >> > > > > us... >> > > > > ). >> > > > >> > > >> > >> > > > > When agent crashed - it's crash JVM. And no event in logs at >> all. >> > > > >> > > >> > > > > We enabled crush dump, and after crash we see next picture: >> > > > >> > > >> > >> > > > > #grep -A1 "Problematic frame" < /hs_err_pid30260.log >> > > > >> > > >> > > > > Problematic frame: >> > > > >> > > >> > > > > C [librbd.so.1.0.0+0x5d681] >> > > > >> > > >> > >> > > > > # gdb /usr/lib/librbd.so.1.0.0 >> /var/tmp/cores/jsvc.25526.0.core >> > > > >> > > >> > > > > (gdb) bt >> > > > >> > > >> > > > > ... >> > > > >> > > >> > > > > #7 0x00007f30b9a1fed2 in >> ceph::log::SubsystemMap::should_gather >> > > > > (level=<optimized out>, sub=<optimized out>, this=<optimized >> out>) >> > > > >> > > >> > > > > at ./log/SubsystemMap.h:62 >> > > > >> > > >> > > > > #8 0x00007f30b9a3b693 in >> ceph::log::SubsystemMap::should_gather >> > > > > (this=<optimized out>, sub=<optimized out>, level=<optimized >> out>) >> > > > >> > > >> > > > > at ./log/SubsystemMap.h:61 >> > > > >> > > >> > > > > #9 0x00007f30b9d879be in ObjectCacher::flusher_entry >> > > > > (this=0x7f2fb4017910) >> > > > > at >> > > > > osdc/ObjectCacher.cc:1527 >> > > > >> > > >> > > > > #10 0x00007f30b9d9851d in ObjectCacher::FlusherThread::entry >> > > > > (this=<optimized >> > > > > out>) at osdc/ObjectCacher.h:374 >> > > > >> > > >> > >> > > > > From ceph code, this part executed when flushing cache >> > object... And we >> > > > > don;t >> > > > > understand why. Becasue we have absolutely different race >> > condition to >> > > > > reproduce it. >> > > > >> > > >> > >> > > > > As cloudstack have not good implementation yet of snapshot >> > lifecycle, >> > > > > sometime, it's happen, that some volumes already marked as >> > EXPUNGED in DB >> > > > > and then cloudstack try to delete bas Volume, before it's try >> to >> > > > > unprotect >> > > > > it. >> > > > >> > > >> > >> > > > > Sure, unprotecting fail, normal exception returned back (fail >> > because >> > > > > snap >> > > > > has childs... ) >> > > > >> > > >> > >> > > > > 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor] >> > > > > (Thread-1304:null) >> > > > > Executing: >> > > > > >> > >> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh >> > -i >> > > > > 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m >> > > > > /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11 >> > > > >> > > >> > > > > 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor] >> > > > > (Thread-1304:null) >> > > > > Execution is successful. >> > > > >> > > >> > > > > 2015-10-29 09:02:19,554 INFO >> [kvm.storage.LibvirtStorageAdaptor] >> > > > > (agentRequest-Handler-5:null) Unprotecting and Removing RBD >> > snapshots of >> > > > > image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to >> > removing the >> > > > > image >> > > > >> > > >> > > > > 2015-10-29 09:02:19,571 DEBUG >> [kvm.storage.LibvirtStorageAdaptor] >> > > > > (agentRequest-Handler-5:null) Succesfully connected to Ceph >> > cluster at >> > > > > cephmon.anolim.net:6789 <http://cephmon.anolim.net:6789> >> > > > >> > > >> > > > > 2015-10-29 09:02:19,608 DEBUG >> [kvm.storage.LibvirtStorageAdaptor] >> > > > > (agentRequest-Handler-5:null) Unprotecting snapshot >> > > > > >> > cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap >> > > > >> > > >> > > > > 2015-10-29 09:02:19,627 DEBUG >> [kvm.storage.KVMStorageProcessor] >> > > > > (agentRequest-Handler-5:null) Failed to delete volume: >> > > > > com.cloud.utils.exception.CloudRuntimeException: >> > > > > com.ceph.rbd.RbdException: >> > > > > Failed to unprotect snapshot cloudstack-base-snap >> > > > >> > > >> > > > > 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent] >> > > > > (agentRequest-Handler-5:null) Seq 4-1921583831: { Ans: , >> MgmtId: >> > > > > 161344838950, via: 4, Ver: v1, Flags: 10, >> > > > > >> > >> >> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException: >> > > > > com.ceph.rbd.RbdException: Failed to unprotect snapshot >> > > > > cloudstack-base-snap","wait":0}}] } >> > > > >> > > >> > > > > 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent] >> > > > > (agentRequest-Handler-2:null) Processing command: >> > > > > com.cloud.agent.api.GetHostStatsCommand >> > > > >> > > >> > > > > 2015-10-29 09:02:25,722 DEBUG >> > [kvm.resource.LibvirtComputingResource] >> > > > > (agentRequest-Handler-2:null) Executing: /bin/bash -c >> > idle=$(top -b -n 1| >> > > > > awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); >> > echo $idle >> > > > >> > > >> > > > > 2015-10-29 09:02:26,249 DEBUG >> > [kvm.resource.LibvirtComputingResource] >> > > > > (agentRequest-Handler-2:null) Execution is successful. >> > > > >> > > >> > > > > 2015-10-29 09:02:26,250 DEBUG >> > [kvm.resource.LibvirtComputingResource] >> > > > > (agentRequest-Handler-2:null) Executing: /bin/bash -c >> > freeMem=$(free|grep >> > > > > cache:|awk '{print $4}');echo $freeMem >> > > > >> > > >> > > > > 2015-10-29 09:02:26,254 DEBUG >> > [kvm.resource.LibvirtComputingResource] >> > > > > (agentRequest-Handler-2:null) Execution is successful. >> > > > >> > > >> > >> > > > > BUT, after 20 minutes - agent crashed... If we remove all >> > childs and >> > > > > create >> > > > > conditions for cloudstack to delete volume - all OK - no agent >> > crash in >> > > > > 20 >> > > > > minutes... >> > > > >> > > >> > >> > > > > We can't connect this action - Volume delete with agent >> > crashe... Also we >> > > > > don't understand why +- 20 minutes need to last, and only then >> > agent >> > > > > crashed... >> > > > >> > > >> > >> > > > > From logs, before crash - only GetVMStats... And then - agent >> > started... >> > > > >> > > >> > >> > > > > 2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] >> > (UgentTask-5:null) >> > > > > Sending >> > > > > ping: Seq 4-1343: { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags: >> 11, >> > > > > >> > >> >> [{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":" >> > > > > cs2.anolim.net <http://cs2.anolim.net> >> > "},"i-7-106-VM":{"state":"PowerOn","host":" cs2.anolim.net >> > <http://cs2.anolim.net> >> > > > > "},"i-1683-1984-VM":{"state":"PowerOn","host":" >> cs2.anolim.net >> > <http://cs2.anolim.net> >> > > > > "},"i-11-504-VM":{"state":"PowerOn","host":" cs2.anolim.net >> > <http://cs2.anolim.net> >> > > > > "},"i-325-616-VM":{"state":"PowerOn","host":" cs2.anolim.net >> > <http://cs2.anolim.net> >> > > > > "},"i-10-52-VM":{"state":"PowerOn","host":" cs2.anolim.net >> > <http://cs2.anolim.net> >> > > > > "},"i-941-1237-VM":{"state":"PowerOn","host":" cs2.anolim.net >> > <http://cs2.anolim.net> >> > > > > >> > >> >> "}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":4,"wait":0}}] >> > > > > } >> > > > >> > > >> > > > > 2015-10-29 09:21:55,149 DEBUG [cloud.agent.Agent] >> > (Agent-Handler-3:null) >> > > > > Received response: Seq 4-1343: { Ans: , MgmtId: 161344838950, >> > via: 4, >> > > > > Ver: >> > > > > v1, Flags: 100010, >> > > > > >> > >> >> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":4,"wait":0},"result":true,"wait":0}}] >> > > > > } >> > > > >> > > >> > > > > 2015-10-29 09:22:20,755 DEBUG [cloud.agent.Agent] >> > > > > (agentRequest-Handler-5:null) Processing command: >> > > > > com.cloud.agent.api.GetVmStatsCommand >> > > > >> > > >> > > > > 2015-10-29 09:23:02,630 INFO [cloud.agent.AgentShell] >> > (main:null) Agent >> > > > > started >> > > > >> > > >> > > > > 2015-10-29 09:23:02,632 INFO [cloud.agent.AgentShell] >> (main:null) >> > > > > Implementation Version is 4.3.2 >> > > > >> > > >> > >> > > > > Can you please give us direction to go deeper... As we dont >> > udnerstand >> > > > > what >> > > > > happened... >> > > > >> > > >> > >> > > > > Issue reproducing very easy: >> > > > >> > > >> > >> > > > > (to speed up delete process we set >> storage.cleanup.interval=10) >> > > > >> > > >> > >> > > > > 1. Create Volume >> > > > >> > > >> > > > > 2. Attach it to VM >> > > > >> > > >> > > > > 3. Create snapshot from this volume by hand on ceph >> > > > >> > > >> > > > > 4. Do snap_protect by hand on ceph for snap from previous step >> > > > >> > > >> > > > > 5. Create child via snap clone >> > > > >> > > >> > > > > 6. Deattach volume from VM and delete it >> > > > >> > > >> > >> > > > > Our problem - that this happen not in such lab situation, but >> > in real >> > > > > life... >> > > > >> > > >> > >> > > _______________________________________________ >> > > ceph-users mailing list >> > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > >> > >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com