[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @yvsubhash Please, take this up. So far this PR hasn't moved forward. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user yvsubhash commented on the issue: https://github.com/apache/cloudstack/pull/1762 @serg38 Is the refactoring suggested by rafael taken care by @nvazquez, else I would take it up --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user cloudmonger commented on the issue: https://github.com/apache/cloudstack/pull/1762 ### ACS CI BVT Run **Sumarry:** Build Number 464 Hypervisor xenserver NetworkType Advanced Passed=104 Failed=1 Skipped=7 _Link to logs Folder (search by build_no):_ https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0 **Failed tests:** * test_routers_network_ops.py * test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failed **Skipped tests:** test_01_test_vm_volume_snapshot test_vm_nic_adapter_vmxnet3 test_static_role_account_acls test_11_ss_nfs_version_on_ssvm test_nested_virtualization_vmware test_3d_gpu_support test_deploy_vgpu_enabled_vm **Passed test suits:** test_deploy_vm_with_userdata.py test_affinity_groups_projects.py test_portable_publicip.py test_over_provisioning.py test_global_settings.py test_scale_vm.py test_service_offerings.py test_routers_iptables_default_policy.py test_loadbalance.py test_routers.py test_reset_vm_on_reboot.py test_deploy_vms_with_varied_deploymentplanners.py test_network.py test_router_dns.py test_non_contigiousvlan.py test_login.py test_deploy_vm_iso.py test_list_ids_parameter.py test_public_ip_range.py test_multipleips_per_nic.py test_regions.py test_affinity_groups.py test_network_acl.py test_pvlan.py test_volumes.py test_nic.py test_deploy_vm_root_resize.py test_resource_detail.py test_secondary_storage.py test_vm_life_cycle.py test_disk_offerings.py --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user rafaelweingartner commented on the issue: https://github.com/apache/cloudstack/pull/1762 @serg38 I have the same understanding about the agent LB. And this is one of the problems I think we have found here. It seems that this method is removing the balance created with agent LB. And, of course, this method is also causing deadlocks. Letâs hear the feedback from others and discuss what we can do forward. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @rafaelweingartner Thanks a lot. I totally agree that resetting hosts doesn't really need to be a part of transaction and should be extracted to a new method. The same is for lines 527-546, and then another one after 551 My understanding of agent LB is that is handled separately from reconnect part. I might be wrong but it is done in ClusteredAgentManagerImpl by scheduling rebalancing task every 60 sec getAgentRebalanceScanTask which takes care of transferring of connected agents. @rhtyd @jburwell @koushik-das @karuturi Do you agree that we can split a transaction in findAndUpdateDirectAgentToLoad into 3 non transactional methods and thus eliminate a one side of a repeated deadlock? This is a very core of agent management that is very hard if ever possible to write smoke test. If so @nvazquez might be able to work on refactoring this method later this month --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user rafaelweingartner commented on the issue: https://github.com/apache/cloudstack/pull/1762 @serg38, it is great that you found one of the methods that cause the deadlock problem âcom.cloud.host.dao.HostDaoImpl.findAndUpdateDirectAgentToLoad(long, Long, long)â. This method surely is problematic. I would first start asking, (i) does it need to manually open a transaction (at line 512)? Isnât that the goal of â@DBâ annotation? (ii) what is the objective of the method (âfindAndUpdateDirectAgentToLoadâ)? It is looking too complicated, with too many accesses to the DB. The method âresetHostsâ at line 517 looks for hosts that are âmanagedâ by the current MS and are âDisconnectedâ to mark them as unmanaged by any MS. That means, it updates the âmanagementServerId = nullâ of hosts marked as âDisconnectâ. Would not it be better to have a specific method/transaction only for the aforementioned process? If we extract that chunk of code to an isolated method, could not we have an atomic access to the DB without locking? âupdate set managementServerId = null from hosts where â¦â¦â; If the method is isolated I do not see reasons for locks here. A little further, there is another method which could be isolated, lines 527 â 546. This block of code looks for clusters being managed by the current MS. Then, it searches for hosts of clusters that are managed by the current MS, which are not being managed by the current MS (or not managed at all?)? I did not understand that because I have seen in some other piece of code that we have a balancing approach; meaning that, we try to balance the number of hosts managed by an MS. This piece of code seems to remove the balancing process. Then, at line 551 and forward (if the number of hosts is less than the limit), it tries to look for hosts of clusters not being managed by any MS. This block could also be an isolated one. And again, we might be able to do this process without using locks. My final comment, even if we choose not to refactor and improve this piece of code, there is one thing that is very strange for me. The method âfindAndUpdateDirectAgentToLoadâ is annotated with â@DBâ, and also opens and tries to manage a transaction manually. Then, we have all of the pieces of code I mentioned, all of them call other methods that also are annotated with â@DBâ. Can this cause a problem? For instance, when I use Spring, methods from a service layer (the place where I configure my pattern of transactions) call one another, they will all use/share the same transaction opened when the first method of the service layer was called, unless specified otherwise. How will it work here in ACS? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @rafaelweingartner I might be wrong but 2d came from findAndUpdateDirectAgentToLoad in HostDaoImpl which also creates a large transaction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @rafaelweingartner You might be right that pod_vlan_map should be in the join. May be I didn't find the correct methods after all. @jburwell @rhtyd What do you think? I was able to find management serve log for Deadlock 1. Looks like one of transaction came from findAndUpdateDirectAgentToLoad method in HostDaoImpl which creates rather complex transaction: 2016-11-24 15:04:39,284 DEBUG [host.dao.HostDaoImpl] (ClusteredAgentManager Timer:ctx-a8e9449c) Resetting hosts suitable for reconnect 2016-11-24 15:04:39,320 DEBUG [db.Transaction.Transaction] (ClusteredAgentManager Timer:ctx-a8e9449c) Rolling back the transaction: Time = 36 Name = ClusteredAgentManager Timer; called by -TransactionLegacy.rollback:879-TransactionLegacy.removeUpTo:822-TransactionLegacy.close:646-TransactionContextInterceptor.invoke:36-ReflectiveMethodInvocation.proceed:161-ExposeInvocationInterceptor.invoke:91-ReflectiveMethodInvocation.proceed:172-JdkDynamicAopProxy.invoke:204-$Proxy48.findAndUpdateDirectAgentToLoad:-1-ClusteredAgentManagerImpl.scanDirectAgentToLoad:195-ClusteredAgentManagerImpl.runDirectAgentScanTimerTask:185-ClusteredAgentManagerImpl.access$100:99 2016-11-24 15:04:39,322 ERROR [agent.manager.ClusteredAgentManagerImpl] (ClusteredAgentManager Timer:ctx-a8e9449c) Unexpected exception DB Exception on: com.mysql.jdbc.JDBC4PreparedStatement@1e58727c: SELECT host.id, host.disconnected, host.name, host.status, host.type, host.private_ip_address, host.private_mac_address, host.private_netmask, host.public_netmask, host.public_ip_address, host.public_mac_address, host.storage_ip_address, host.cluster_id, host.storage_netmask, host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, host.resource, host.fs_type, host.available, host.setup, host.resource_state, host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, host.mgmt_server_id, host.dom0_memory, host.version, host.created, h ost.removed FROM host WHERE host.resource IS NOT NULL AND host.mgmt_server_id = 345048964870 AND host.last_ping <= 1445339907 AND host.cluster_id IS NOT NULL AND host.status IN ('Disconnected','Down','Alert') AND host.removed IS NULL FOR UPDATE Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction Beginning of second transaction was SELECT host.id, host.disconnected, host.name, host.status, host.type, host.private_ip_address, host.private_mac_address, host.private_netmask, host.public_netmask, host.public_ip_address, host.public_mac_address, host.storage_ip_address, host.cluster_id, host.storage_netmask, host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, host.resource, host.fs_type, host.available, host.setup, host.resource_state, host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, host.mgmt_server_id, host.dom0_memory, host.version, host.created, host.removed FROM host LEFT OUTER JOIN op_host_transfer ON host.id=op_host_transfer.id IN I will try to trace it to the ACS method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user rafaelweingartner commented on the issue: https://github.com/apache/cloudstack/pull/1762 @serg38 if that "AssignIpAddressFromPodVlanSearch" object was being used to generate the SQL; should not we see a join with "pod_vlan_map" too? For me this, this SC is very confusing. Following the same idea of what I would do if using Spring to manage transactions, the method "fetchNewPublicIp" does not need the "@DB" annotation (assuming this is the annotation that opens a transaction and locks tables in ACS). The method âfetchNewPublicIpâ is a simple "retrieve/get" method. Whenever we have to lock the table that is being used by this method, we could use the "fetchNewPublicIp" in a method that has the "@DB" annotation (assuming it has transaction propagation). This is something that already seems to happen. Methods "allocateIp" and "assignDedicateIpAddress" use âfetchNewPublicIpâ and they have their own â@DBâ annotation. Methods âassignPublicIpAddressFromVlansâ and âassignPublicIpAddressâ seem not to do anything that requires a transaction; despite misleading (at least for me) with names indicating that something will be assigned to someone, they just call and return the response of âfetchNewPublicIpâ method. Therefore, I do not think they require a locking transaction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @rafaelweingartner Tried tracing where deadlock 5 originated. It seems both transactions are part of the same method fetchNewPublicIp in IpAddressManagerImpl . Transactions are executed on different management servers. Update is triggered through markPublicIpAsAllocated method Select seems to come from there as well fetchNewPublicIp in IpAddressManagerImpl AssignIpAddressFromPodVlanSearch = _ipAddressDao.createSearchBuilder(); AssignIpAddressFromPodVlanSearch.and("dc", AssignIpAddressFromPodVlanSearch.entity().getDataCenterId(), Op.EQ); AssignIpAddressFromPodVlanSearch.and("allocated", AssignIpAddressFromPodVlanSearch.entity().getAllocatedTime(), Op.NULL); SearchBuilder podVlanSearch = _vlanDao.createSearchBuilder(); podVlanSearch.and("type", podVlanSearch.entity().getVlanType(), Op.EQ); podVlanSearch.and("networkId", podVlanSearch.entity().getNetworkId(), Op.EQ); SearchBuilder podVlanMapSB = _podVlanMapDao.createSearchBuilder(); podVlanMapSB.and("podId", podVlanMapSB.entity().getPodId(), Op.EQ); AssignIpAddressFromPodVlanSearch.join("podVlanMapSB", podVlanMapSB, podVlanMapSB.entity().getVlanDbId(), AssignIpAddressFromPodVlanSearch.entity().getVlanId(), JoinType.INNER); AssignIpAddressFromPodVlanSearch.join("vlan", podVlanSearch, podVlanSearch.entity().getId(), AssignIpAddressFromPodVlanSearch.entity().getVlanId(), JoinType.INNER); AssignIpAddressFromPodVlanSearch.done(); public IPAddressVO doInTransaction(TransactionStatus status) throws InsufficientAddressCapacityException { StringBuilder errorMessage = new StringBuilder("Unable to get ip adress in "); boolean fetchFromDedicatedRange = false; List dedicatedVlanDbIds = new ArrayList(); List nonDedicatedVlanDbIds = new ArrayList(); SearchCriteria sc = null; if (podId != null) { sc = **AssignIpAddressFromPodVlanSearch**.create(); sc.setJoinParameters("podVlanMapSB", "podId", podId); errorMessage.append(" pod id=" + podId); } else { sc = AssignIpAddressSearch.create(); errorMessage.append(" zone id=" + dcId); } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @rafaelweingartner Looks like the deadlocks 2 and 3 are the same. I scanned our production log and since last December we had 6400 deadlocks. Out of them close to 6000 were Deadlock 1 20 were Deadlock 2 and 700 of a different Deadlock 5. The other deadlocks were in negligible numbers. I think if we figure out Deadlock 1 and Deadlock 5 this will be good start. I will try to find the source of transactions for them. In production we run a commercial distribution based in most part on 4.7 branch of ACS. Deadlock 5 *** (1) TRANSACTION: TRANSACTION D518886F8, ACTIVE 2 sec fetching rows mysql tables in use 4, locked 4 LOCK WAIT 24 lock struct(s), heap size 3112, 8 row lock(s), undo log entries 17 MySQL thread id 29781, OS thread handle 0x7f9df36db700, query id 3625404021 ussclpdcsmgt012.autodesk.com 10.41.13.14 cloud Sorting result SELECT user_ip_address.id, user_ip_address.account_id, user_ip_address.domain_id, user_ip_address.public_ip_address, user_ip_address.data_center_id, user_ip_address.source_n at, user_ip_address.allocated, user_ip_address.vlan_db_id, user_ip_address.one_to_one_nat, user_ip_address.vm_id, user_ip_address.state, user_ip_address.mac_address, user_ip _address.source_network_id, user_ip_address.network_id, user_ip_address.uuid, user_ip_address.physical_network_id, user_ip_address.is_system, user_ip_address.vpc_id, user_ip _address.dnat_vmip, user_ip_address.is_portable, user_ip_address.display, user_ip_address.removed, user_ip_address.created FROM user_ip_address INNER JOIN vlan ON user_ip_a ddress.vlan_db_id=vlan.id WHERE user_ip_address.data_center_id = 6 AND user_ip_address.allocated IS NULL AND user_ip_address.vlan_db_id IN (32,33,36,37,41,61,62,91,92,93,9 4,106,107,108,109,11 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: *** (2) TRANSACTION: TRANSACTION D5188582B, ACTIVE 17 sec updating or deleting, thread declared inside InnoDB 499 mysql tables in use 1, locked 1 25 lock struct(s), heap size 3112, 13 row lock(s), undo log entries 18 MySQL thread id 29820, OS thread handle 0x7fa35a868700, query id 3625417999 ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud Updating UPDATE user_ip_address SET user_ip_address.source_nat=0, user_ip_address.is_system=0, user_ip_address.account_id=3309, user_ip_address.allocated='2016-03-25 15:36:39', user_ip_address.state='Allocated', user_ip_address.domain_id=335 WHERE user_ip_address.id = 3284 *** (2) HOLDS THE LOCK(S): --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user rafaelweingartner commented on the issue: https://github.com/apache/cloudstack/pull/1762 Thanks, @serg38. Looking at the SQLs you posted. We could start to discuss whether or not some SQLs statements need locking transactions. Ignoring Deadlocks 3 and 4 for now, I think we could start with the ones the look the simplest (Deadlocks 1 and 2). These SQLS have probably being generated, so tracking them on ACS may not be that easy, but at first glance, I feel that we could execute them without needing lock in the database. I tried to find the first SQL, without success. Would you mind helping me pin point where in the code the SQL from transaction 1 at deadlock 1 is generated? Then, we can evaluate if it is or not needed a lock there. Are the SQLs you showed complete? I found a place that could generate SQLs similar to the one at transaction 1 and deadlock 1, but this code adds one extra where clause. The method I am talking about is: com.cloud.cluster.agentlb.ClusterBasedAgentLoadBalancerPlanner.getHostsToRebalance(long, int) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 Here it is few samples of deadlocks we observe in high transaction volume environment with multiple management servers. As you can see most of them are concurrent operations from different management servers and either select or select for update statements. The following 4 types account for the majority of deadlock s we saw so far ( 80-90% of all deadlocks). Deadlock 1-3 happens much more often than deadlock 4. It is next to impossible to reproduce since they occur one in few days with 4 management servers and average VM deployment volume of 3000 a day. Deadlock type 1: InnoDB: transactions deadlock detected, dumping detailed information. 151217 3:08:20 *** (1) TRANSACTION: TRANSACTION BB4D4C91D, ACTIVE 0 sec fetching rows mysql tables in use 1, locked 1 LOCK WAIT 11 lock struct(s), heap size 3112, 5 row lock(s) MySQL thread id 47654, OS thread handle 0x7f0475bdd700, query id 3821358107 ussclpdcsmgt012.autodesk.com 10.41.13.14 cloud Sending data SELECT host.id, host.disconnected, host.name, host.status, host.type, host.private_ip_address, host.private_mac_address, host.private_netmask, host.public_netmask, host.public_ip_address, host.public_mac_address, host.storage_ip_address, host.cluster_id, host.storage_netmask, host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, host.resource, host.fs_type, host.available, host.setup, host.resource_state, host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, host.mgmt_server_id, host.dom0_memory, host.version, host.created, host.removed FROM host WHERE host.resource IS NOT NULL AND host.mgmt_server_id = 345048964870 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: *** (2) TRANSACTION: TRANSACTION BB4D4C915, ACTIVE 1 sec fetching rows, thread declared inside InnoDB 449 mysql tables in use 3, locked 3 29 lock struct(s), heap size 6960, 15 row lock(s), undo log entries 1 MySQL thread id 47623, OS thread handle 0x7f0a47074700, query id 3821724056 ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud Copying to tmp table SELECT host.id, host.disconnected, host.name, host.status, host.type, host.private_ip_address, host.private_mac_address, host.private_netmask, host.public_netmask, host.public_ip_address, host.public_mac_address, host.storage_ip_address, host.cluster_id, host.storage_netmask, host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, host.resource, host.fs_type, host.available, host.setup, host.resource_state, host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, host.mgmt_server_id, host.dom0_memory, host.version, host.created, host.removed FROM host LEFT OUTER JOIN op_host_transfer ON host.id=op_host_transfer.id *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 0 page no 147488 n bits 840 index `i_host__removed` of table `cloud`.`host` trx id BB4D4C915 lock_mode X locks rec but not gap Deadlock 2: InnoDB: transactions deadlock detected, dumping detailed information. 151218 11:03:00 *** (1) TRANSACTION: TRANSACTION BBB232C81, ACTIVE 51 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 3 lock struct(s), heap size 1248, 2 row lock(s) MySQL thread id 57308, OS thread handle 0x7f0a45c24700, query id 5217973695 ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud Sending data SELECT resource_count.id, resource_count.type, resource_count.account_id, resource_count.domain_id, resource_count.count FROM resource_count WHERE resource_count.id IN (5083,4867,5079,33652,5077) FOR UPDATE *** (1) WAITING FOR THIS LOCK TO BE GRANTED: *** (2) TRANSACTION: TRANSACTION BBB2254AC, ACTIVE 116 sec starting index read, thread declared inside InnoDB 500 mysql tables in use 1, locked 1 207 lock struct(s), heap size 31160, 1650 row lock(s), undo log entries 2 MySQL thread id 56926, OS thread handle 0x7f04756c9700, query id 5218549710 ussclpdcsmgt014.autodesk.com 10.41.13.16 cloud Sending data SELECT resource_count.id, resource_count.type, resource_count.account_id, resource_count.domain_id, resource_count.count FROM resource_count WHERE resource_count.id IN (5083,4867,5079,33652,5077) FOR UPDATE Deadlock 3: ** (1) TRANSACTION: TRANSACTION BBB232C81, ACTIVE 51 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 3 lock
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user rafaelweingartner commented on the issue: https://github.com/apache/cloudstack/pull/1762 @serg38 I have just now started reading this PR (excuse me if I overlooked some information). > If we are to try to implement a general way of dealing with deadlocks in ACS how could it be done to ensure DB consistency and correct transaction retry? Answering your question; in my opinion, we should not âtryâ to implement a general way of managing transactions. We are only having this type of problem because instead of using a framework to manage access and transactions in databases, it was developed a module to do that and incorporated to ACS; this means we have to maintain and live with this code. Now, the problem is that it would be a Dantesque task to change the way ACS manages transactions today. I am with John on this one, retrying is not a good idea; it can hide problems, cause overheads and cause even more headaches. I think that the best approach is to deal with this type of problem on the fly; this means, as John said, addressing them as bugs when they are reported. Having said that, I have not helped a bit to solve the problem⦠Letâs see if I can be of any help. I was reading the ticket #CLOUDSTACK-9595. It seems that the problem (reported there) happened when a VM was being removed from a table âinstance_group_vm_mapâ. I just do not understand because the method called is âUserVmManagerImpl.addInstanceToGroupâ. I am hoping that this makes sense. Anyways⦠The MYSQL docs have the following on deadlocks: > A deadlock is a situation where different transactions are unable to proceed because each holds a lock that the other needs This means, there was something else being executed when that VM was deleted/added, and this caused the deadlock and the exception. Probably something else is using the table âinstance_group_vm_mapâ. I think we should track these two tasks/processes that can cause the problem and work them out, instead of looking for a generic way to deal with this situation. Maybe these processes that are causing deadlock are locking tables that are not needed or executing some processing that could be avoided or modified. Do we use case that can reproduce the problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @rafaelweingartner @swill @wido @koushik-das @karuturi @rhtyd @jburwell Let's ask a different question. If we are to try to implement a general way of dealing with deadlocks in ACS how could it be done to ensure DB consistency and correct transaction retry? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user abhinandanprateek commented on the issue: https://github.com/apache/cloudstack/pull/1762 Even trying the full transaction again could be problematic as there might be checks done before firing the transaction that may not be valid now. The thing is it may mostly work, but it is not fool proof. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 What about if the author can figure out a way to identify all part of transaction being cancelled and retry all parts? Or retry the whole transaction? It would be nice to open a path for the author to implement this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1762 @serg38 corruption could happen at any point -- it's a ticking time bomb. From a ACID perspective, this patch fails from a consistency perspective. All data being updated must be re-queried and validated in order to ensure the consistency guarantee is not violated. In a high volume system, it's not a matter of if, but when a sequence of events will occur and corrupt the database. Bear in mind, these corruptions be in the content of the data and would not yield a MySQL error. They will be phenomenon such as phantom rows or inconsistent data updates As I said previously, the only real solution to deadlocks is to fix the way the system manages transactions and locks. This patch is merely hiding an error while creating the potential for far larger problems. For these reasons, I remain -1 on merging this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @jburwell We've been running this fix as a part of proprietary CS for several weeks now. We are observing elimination of deadlocks and no DB corruption. Retry seems to be the only realistic way of dealing with deadlocks in complex environment like ACS. Can we come up with a limited scope/conditions of this PR to move forward ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1762 @rhtyd I am -1 on this PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user rhtyd commented on the issue: https://github.com/apache/cloudstack/pull/1762 @abhinandanprateek can you help reviewing this one, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user blueorangutan commented on the issue: https://github.com/apache/cloudstack/pull/1762 Trillian test result (tid-347) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 26094 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr1762-t347-kvm-centos7.zip Test completed. 47 look ok, 1 have error(s) Test | Result | Time (s) | Test File --- | --- | --- | --- test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL | `Failure` | 369.46 | test_vpc_redundant.py test_01_vpc_site2site_vpn | Success | 154.87 | test_vpc_vpn.py test_01_vpc_remote_access_vpn | Success | 66.24 | test_vpc_vpn.py test_01_redundant_vpc_site2site_vpn | Success | 255.75 | test_vpc_vpn.py test_02_VPC_default_routes | Success | 273.12 | test_vpc_router_nics.py test_01_VPC_nics_after_destroy | Success | 534.12 | test_vpc_router_nics.py test_05_rvpc_multi_tiers | Success | 513.09 | test_vpc_redundant.py test_04_rvpc_network_garbage_collector_nics | Success | 1408.56 | test_vpc_redundant.py test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers | Success | 553.46 | test_vpc_redundant.py test_02_redundant_VPC_default_routes | Success | 753.12 | test_vpc_redundant.py test_09_delete_detached_volume | Success | 15.44 | test_volumes.py test_08_resize_volume | Success | 15.36 | test_volumes.py test_07_resize_fail | Success | 20.45 | test_volumes.py test_06_download_detached_volume | Success | 15.29 | test_volumes.py test_05_detach_volume | Success | 100.25 | test_volumes.py test_04_delete_attached_volume | Success | 10.18 | test_volumes.py test_03_download_attached_volume | Success | 15.30 | test_volumes.py test_02_attach_volume | Success | 73.79 | test_volumes.py test_01_create_volume | Success | 712.21 | test_volumes.py test_deploy_vm_multiple | Success | 278.61 | test_vm_life_cycle.py test_deploy_vm | Success | 0.03 | test_vm_life_cycle.py test_advZoneVirtualRouter | Success | 0.02 | test_vm_life_cycle.py test_10_attachAndDetach_iso | Success | 26.47 | test_vm_life_cycle.py test_09_expunge_vm | Success | 125.19 | test_vm_life_cycle.py test_08_migrate_vm | Success | 35.86 | test_vm_life_cycle.py test_07_restore_vm | Success | 0.10 | test_vm_life_cycle.py test_06_destroy_vm | Success | 125.83 | test_vm_life_cycle.py test_03_reboot_vm | Success | 125.82 | test_vm_life_cycle.py test_02_start_vm | Success | 10.16 | test_vm_life_cycle.py test_01_stop_vm | Success | 40.30 | test_vm_life_cycle.py test_CreateTemplateWithDuplicateName | Success | 75.58 | test_templates.py test_08_list_system_templates | Success | 0.04 | test_templates.py test_07_list_public_templates | Success | 0.04 | test_templates.py test_05_template_permissions | Success | 0.06 | test_templates.py test_04_extract_template | Success | 5.18 | test_templates.py test_03_delete_template | Success | 5.10 | test_templates.py test_02_edit_template | Success | 90.17 | test_templates.py test_01_create_template | Success | 70.57 | test_templates.py test_10_destroy_cpvm | Success | 131.62 | test_ssvm.py test_09_destroy_ssvm | Success | 163.18 | test_ssvm.py test_08_reboot_cpvm | Success | 101.42 | test_ssvm.py test_07_reboot_ssvm | Success | 133.53 | test_ssvm.py test_06_stop_cpvm | Success | 166.54 | test_ssvm.py test_05_stop_ssvm | Success | 133.56 | test_ssvm.py test_04_cpvm_internals | Success | 0.95 | test_ssvm.py test_03_ssvm_internals | Success | 3.27 | test_ssvm.py test_02_list_cpvm_vm | Success | 0.11 | test_ssvm.py test_01_list_sec_storage_vm | Success | 0.12 | test_ssvm.py test_01_snapshot_root_disk | Success | 16.20 | test_snapshots.py test_04_change_offering_small | Success | 209.44 | test_service_offerings.py test_03_delete_service_offering | Success | 0.03 | test_service_offerings.py test_02_edit_service_offering | Success | 0.05 | test_service_offerings.py test_01_create_service_offering | Success | 0.10 | test_service_offerings.py test_02_sys_template_ready | Success | 0.12 | test_secondary_storage.py test_01_sys_vm_start | Success | 0.17 | test_secondary_storage.py test_09_reboot_router | Success | 30.26 | test_routers.py test_08_start_router | Success | 25.25 | test_routers.py test_07_stop_router | Success | 10.15 | test_routers.py test_06_router_advanced | Success | 0.05 | test_routers.py test_05_router_basic | Success | 0.04 | test_routers.py test_04_restart_network_wo_cleanup | Success | 5.59 | test_routers.py test_03_restart_network_cleanup | Success | 50.47 | test_routers.py test_02_router_internal_adv | Success | 1.09 | test_routers.py test_01_router_internal_basic | Success | 0.58 | test_routers.py test_router_dns_guestipquery | Success | 76.67 | test_router_dns.py
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user blueorangutan commented on the issue: https://github.com/apache/cloudstack/pull/1762 @rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user rhtyd commented on the issue: https://github.com/apache/cloudstack/pull/1762 @blueorangutan test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user blueorangutan commented on the issue: https://github.com/apache/cloudstack/pull/1762 Packaging result: âcentos6 âcentos7 âdebian. JID-164 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user blueorangutan commented on the issue: https://github.com/apache/cloudstack/pull/1762 @rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user rhtyd commented on the issue: https://github.com/apache/cloudstack/pull/1762 @blueorangutan package --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1762 Due to the previous discussion, I am -1 on merging this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user cloudmonger commented on the issue: https://github.com/apache/cloudstack/pull/1762 ### ACS CI BVT Run **Sumarry:** Build Number 135 Hypervisor xenserver NetworkType Advanced Passed=102 Failed=3 Skipped=6 _Link to logs Folder (search by build_no):_ https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0 **Failed tests:** * test_non_contigiousvlan.py * test_extendPhysicalNetworkVlan Failed * test_deploy_vm_iso.py * test_deploy_vm_from_iso Failing since 19 runs * test_vm_life_cycle.py * test_10_attachAndDetach_iso Failing since 20 runs **Skipped tests:** test_01_test_vm_volume_snapshot test_vm_nic_adapter_vmxnet3 test_static_role_account_acls test_11_ss_nfs_version_on_ssvm test_3d_gpu_support test_deploy_vgpu_enabled_vm **Passed test suits:** test_deploy_vm_with_userdata.py test_affinity_groups_projects.py test_portable_publicip.py test_over_provisioning.py test_global_settings.py test_scale_vm.py test_service_offerings.py test_routers_iptables_default_policy.py test_loadbalance.py test_routers.py test_reset_vm_on_reboot.py test_snapshots.py test_deploy_vms_with_varied_deploymentplanners.py test_network.py test_router_dns.py test_login.py test_list_ids_parameter.py test_public_ip_range.py test_multipleips_per_nic.py test_regions.py test_affinity_groups.py test_network_acl.py test_pvlan.py test_volumes.py test_nic.py test_deploy_vm_root_resize.py test_resource_detail.py test_secondary_storage.py test_routers_network_ops.py test_disk_offerings.py --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1762 @serg38 with custom plugins, there is no way to reliably perform such tracing. I can think of batch cleanup operations in the storage layer that follow the pattern I described. Even if there were, we would have planted a landline for future changes to the system. Deadlocks are significant technical debt that are clearly causing significant operational issues. Unfortunately, there is no way to address them generically --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @jburwell I concur but if @yvsubhash verified that those methods don't participate in complex DML transactions this might be still a good start. If so this approach might be expanded later to multi DML transaction so that each piece can be retired individually. I myself traced few deadlocks in ACS using native mysql deadlock logging and it doesn't seem there would be a viable alternative to retires due to well known complexity of ACS DB operations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1762 @serg38 there remains a risk when those methods are executed in the context of an open transaction where DMLs have already been executed and subsequent DMLs will be executed. In this scenario, the first set of the changes would be lost due to the rollback triggered by the query deadlock with the second set proceeding successfully. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @jburwell @yvsubhash I might be wrong but this PR will retry on deadlock for only 2 DAO methods searchIncludingRemoved and customSearchIncludingRemoved. No update methods are set with this retry mechanism. If that's the case there is no risk of corrupting DB. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1762 @serg38 that is not a safe assumption. Transactions often span multiple statements and methods across DAOs. `TransactionLegacy` has a transaction stacking/nested model that further occludes when a transaction actually completely. Deadlocks are a severe problem that need to be fixed. Unfortunately, this patch would do more harm than good as it would eventually corrupt the database. In, and of themselves, retries are also a very expensive solution to the problem both in terms of the engineering effort required to do it properly and the extra stress placed on the database to perform additional work that will likely fail. Furthermore, a generic **and** correct retry mechanism is a very difficult thing to write. Given the way transaction boundaries are managed in ACS, I think such an effort would be nearly impossible. In a properly written application, deadlocks should very rarely, if ever, occur. Their presence is a symptom of improper transaction handling and/or poor lock management problems. Therefore, my suggestion is that we change this patch to log details about the context in which deadlocks occur. We can then use this information to identify the areas in ACS where these contention problems are location and fix the root cause. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @jburwell I thought that most if not all of ACS interaction through DAO is rather atomic transactions. Do we have cases of multiple DML statements as a part of the same transaction? We have been seeing quite a few deadlock in a high transaction volume environments where multiple management servers are employed. This causes quite a pain for users due to the randomness and no good recourse/explanation. I would argue that proper retry is a better choice should we cover all the cases including all cases with complex transactions. We have been successful leveraging this approach in systems built on the top of ACS. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1762 @serg38 my reading of the code is that only the most recently attempted DML will be re-executed. Furthermore, retrying without refreshing the base data can also lead to data corruption. The best thing to do in a case of a dead lock is to fail and rollback due to the risk of data corruption. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 @jburwell @yvsubhash My understanding that all roll back statements will receive MYSQL_DEADLOCK_ERROR_CODE and will be retired as a part of this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1762 @yvsubhash according to the (MySQL deadlock documenation)[http://dev.mysql.com/doc/refman/5.7/en/innodb-deadlocks.html], a `MYSQL_DEADLOCK_ERROR_CODE` error indicates the enclosing transaction has been rolled back. The proper handling for this error is to re-execute all statements executed in the aborted transaction. From a best practices perspective, all base data should be re-retrieved and changed to ensure logical consistency with changes made by the transaction that won deadlock resolution. As I understand this patch, only the most recently executed DML is retried. Therefore, any previously executed changes will be discarded and the DML will be re-executed either in a new transaction or in auto-commit (I didn't look up how the client handles the transaction context in this scenario). If my understanding is correct, this patch could lead to issues ranging from unexpected foreign key integrity errors to data corruption. Rather attempting to implement a generic retry, I think the best approach to addressing deadlocks is to treat them bugs. This patch could be modified to provide detailed logging information about the conditions under which a deadlock occurs providing the information necessary to refactor the system to avoid lock contention. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...
Github user serg38 commented on the issue: https://github.com/apache/cloudstack/pull/1762 LGTM. Finally !!! We have been seeing occasional deadlocks in environments with high level transaction rate. @rhtyd @jburwell This could be a good add to 4.8/4.9. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---