[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2017-04-14 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@yvsubhash Please, take this up. So far this PR hasn't moved forward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2017-04-14 Thread yvsubhash
Github user yvsubhash commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@serg38 Is the refactoring suggested by rafael taken care by  @nvazquez, 
else I would take it up


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2017-03-13 Thread cloudmonger
Github user cloudmonger commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
### ACS CI BVT Run
 **Sumarry:**
 Build Number 464
 Hypervisor xenserver
 NetworkType Advanced
 Passed=104
 Failed=1
 Skipped=7

_Link to logs Folder (search by build_no):_ 
https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0


**Failed tests:**
* test_routers_network_ops.py

 * test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failed


**Skipped tests:**
test_01_test_vm_volume_snapshot
test_vm_nic_adapter_vmxnet3
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

**Passed test suits:**
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_non_contigiousvlan.py
test_login.py
test_deploy_vm_iso.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_volumes.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_vm_life_cycle.py
test_disk_offerings.py


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-30 Thread rafaelweingartner
Github user rafaelweingartner commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@serg38 I have the same understanding about the agent LB. And this is one 
of the problems I think we have found here. It seems that this method is 
removing the balance created with agent LB. And, of course, this method is also 
causing deadlocks.

Let’s hear the feedback from others and discuss what we can do forward. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-29 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@rafaelweingartner Thanks a lot. I totally agree that resetting hosts 
doesn't really need to be a part of transaction and should be extracted to a 
new method. The same is for lines 527-546, and then another one after 551
My understanding of agent LB is that is handled separately from reconnect 
part. I might be wrong but it is done in ClusteredAgentManagerImpl by 
scheduling rebalancing task every 60 sec
getAgentRebalanceScanTask which takes care of transferring of connected 
agents.
@rhtyd @jburwell @koushik-das @karuturi Do you agree that we can split a 
transaction in findAndUpdateDirectAgentToLoad into 3 non transactional methods 
and thus eliminate a one side of a repeated deadlock? This is a very core of 
agent management that is very hard if ever possible to write smoke test. If so 
@nvazquez might be able to work on refactoring this method later this month



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-27 Thread rafaelweingartner
Github user rafaelweingartner commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@serg38, it is great that you found one of the methods that cause the 
deadlock problem 
“com.cloud.host.dao.HostDaoImpl.findAndUpdateDirectAgentToLoad(long, Long, 
long)”.

This method surely is problematic. I would first start asking, (i) does it 
need to manually open a transaction (at line 512)? Isn’t that the goal of 
“@DB” annotation? (ii) what is the objective of the method 
(“findAndUpdateDirectAgentToLoad”)? It is looking too complicated, with too 
many accesses to the DB.

The method “resetHosts” at line 517 looks for hosts that are 
“managed” by the current MS and are “Disconnected” to mark them as 
unmanaged by any MS. That means, it updates the “managementServerId = null” 
of hosts marked as “Disconnect”.

Would not it be better to have a specific method/transaction only for the 
aforementioned process?  If we extract that chunk of code to an isolated 
method, could not we have an atomic access to the DB without locking? “update 
set managementServerId = null from hosts where ……”; If the method is 
isolated I do not see reasons for locks here.

A little further, there is another method which could be isolated, lines 
527 – 546. This block of code looks for clusters being managed by the current 
MS. Then, it searches for hosts of clusters that are managed by the current MS, 
which are not being managed by the current MS (or not managed at all?)? I did 
not understand that because I have seen in some other piece of code that we 
have a balancing approach; meaning that, we try to balance the number of hosts 
managed by an MS.  This piece of code seems to remove the balancing process.

Then, at line 551 and forward (if the number of hosts is less than the 
limit), it tries to look for hosts of clusters not being managed by any MS. 
This block could also be an isolated one. And again, we might be able to do 
this process without using locks.

My final comment, even if we choose not to refactor and improve this piece 
of code, there is one thing that is very strange for me. The method 
“findAndUpdateDirectAgentToLoad”  is annotated with “@DB”, and also 
opens and tries to manage a transaction manually. Then, we have all of the 
pieces of code I mentioned, all of them call other methods that also are 
annotated with “@DB”. Can this cause a problem?

For instance, when I use Spring, methods from a service layer (the place 
where I configure my pattern of transactions) call one another, they will all 
use/share the same transaction opened when the first method of the service 
layer was called, unless specified otherwise. How will it work here in ACS?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-25 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@rafaelweingartner I might be wrong but 2d  came from 
findAndUpdateDirectAgentToLoad in HostDaoImpl  which also creates a large 
transaction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-25 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@rafaelweingartner You might be right that pod_vlan_map should be in the 
join. May be I didn't find the correct methods after all. @jburwell @rhtyd What 
do you think?

I was able to find management serve log for Deadlock 1. Looks like one of 
transaction came from findAndUpdateDirectAgentToLoad  method in HostDaoImpl 
which creates rather complex transaction:

2016-11-24 15:04:39,284 DEBUG [host.dao.HostDaoImpl] (ClusteredAgentManager 
Timer:ctx-a8e9449c) Resetting hosts suitable for reconnect
2016-11-24 15:04:39,320 DEBUG [db.Transaction.Transaction] 
(ClusteredAgentManager Timer:ctx-a8e9449c) Rolling back the transaction: Time = 
36 Name =  ClusteredAgentManager Timer; called by 
-TransactionLegacy.rollback:879-TransactionLegacy.removeUpTo:822-TransactionLegacy.close:646-TransactionContextInterceptor.invoke:36-ReflectiveMethodInvocation.proceed:161-ExposeInvocationInterceptor.invoke:91-ReflectiveMethodInvocation.proceed:172-JdkDynamicAopProxy.invoke:204-$Proxy48.findAndUpdateDirectAgentToLoad:-1-ClusteredAgentManagerImpl.scanDirectAgentToLoad:195-ClusteredAgentManagerImpl.runDirectAgentScanTimerTask:185-ClusteredAgentManagerImpl.access$100:99
2016-11-24 15:04:39,322 ERROR [agent.manager.ClusteredAgentManagerImpl] 
(ClusteredAgentManager Timer:ctx-a8e9449c) Unexpected exception DB Exception 
on: com.mysql.jdbc.JDBC4PreparedStatement@1e58727c: SELECT host.id, 
host.disconnected, host.name, host.status, host.type, host.private_ip_address, 
host.private_mac_address, host.private_netmask, host.public_netmask, 
host.public_ip_address, host.public_mac_address, host.storage_ip_address, 
host.cluster_id, host.storage_netmask, host.storage_mac_address, 
host.storage_ip_address_2, host.storage_netmask_2, host.storage_mac_address_2, 
host.hypervisor_type, host.proxy_port, host.resource, host.fs_type, 
host.available, host.setup, host.resource_state, host.hypervisor_version, 
host.update_count, host.uuid, host.data_center_id, host.pod_id, 
host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, host.parent, 
host.guid, host.capabilities, host.total_size, host.last_ping, 
host.mgmt_server_id, host.dom0_memory, host.version, host.created, h
 ost.removed FROM host WHERE host.resource IS NOT NULL  AND host.mgmt_server_id 
= 345048964870  AND host.last_ping <= 1445339907  AND host.cluster_id IS NOT 
NULL  AND host.status IN ('Disconnected','Down','Alert')  AND host.removed IS 
NULL  FOR UPDATE 
Caused by: 
com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock 
found when trying to get lock; try restarting transaction

Beginning of second transaction was 
SELECT host.id, host.disconnected, host.name, host.status, host.type, 
host.private_ip_address, host.private_mac_address, host.private_netmask, 
host.public_netmask, host.public_ip_address, host.public_mac_address, 
host.storage_ip_address, host.cluster_id, host.storage_netmask, 
host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, 
host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, 
host.resource, host.fs_type, host.available, host.setup, host.resource_state, 
host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, 
host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, 
host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, 
host.mgmt_server_id, host.dom0_memory, host.version, host.created, host.removed 
FROM host  LEFT OUTER JOIN op_host_transfer ON host.id=op_host_transfer.id  IN

I will try to trace it to the ACS method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-25 Thread rafaelweingartner
Github user rafaelweingartner commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@serg38 if that "AssignIpAddressFromPodVlanSearch" object was being used to 
generate the SQL; should not we see a join with "pod_vlan_map" too? For me 
this, this SC is very confusing.

Following the same idea of what I would do if using Spring to manage 
transactions, the method "fetchNewPublicIp" does not need the "@DB" annotation 
(assuming this is the annotation that opens a transaction and locks tables in 
ACS). The method “fetchNewPublicIp” is a simple "retrieve/get" method. 
Whenever we have to lock the table that is being used by this method, we could 
use the "fetchNewPublicIp" in a method that has the "@DB" annotation (assuming 
it has transaction propagation). This is something that already seems to 
happen. Methods "allocateIp" and "assignDedicateIpAddress" use 
“fetchNewPublicIp” and they have their own “@DB” annotation.

Methods “assignPublicIpAddressFromVlans” and 
“assignPublicIpAddress” seem not to do anything that requires a 
transaction; despite misleading (at least for me) with names indicating that 
something will be assigned to someone, they just call and return the response 
of  “fetchNewPublicIp” method. Therefore, I do not think they require a 
locking transaction.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-24 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@rafaelweingartner Tried tracing where  deadlock 5 originated. It seems 
both transactions are part of the same method fetchNewPublicIp in 
IpAddressManagerImpl  . Transactions are executed on different management 
servers. 
Update is triggered through markPublicIpAsAllocated  method 

Select seems to come from there as well  fetchNewPublicIp in 
IpAddressManagerImpl

AssignIpAddressFromPodVlanSearch = 
_ipAddressDao.createSearchBuilder();
AssignIpAddressFromPodVlanSearch.and("dc", 
AssignIpAddressFromPodVlanSearch.entity().getDataCenterId(), Op.EQ);
AssignIpAddressFromPodVlanSearch.and("allocated", 
AssignIpAddressFromPodVlanSearch.entity().getAllocatedTime(), Op.NULL);
SearchBuilder podVlanSearch = 
_vlanDao.createSearchBuilder();
podVlanSearch.and("type", podVlanSearch.entity().getVlanType(), 
Op.EQ);
podVlanSearch.and("networkId", 
podVlanSearch.entity().getNetworkId(), Op.EQ);
SearchBuilder podVlanMapSB = 
_podVlanMapDao.createSearchBuilder();
podVlanMapSB.and("podId", podVlanMapSB.entity().getPodId(), Op.EQ);
AssignIpAddressFromPodVlanSearch.join("podVlanMapSB", podVlanMapSB, 
podVlanMapSB.entity().getVlanDbId(), 
AssignIpAddressFromPodVlanSearch.entity().getVlanId(),
JoinType.INNER);
AssignIpAddressFromPodVlanSearch.join("vlan", podVlanSearch, 
podVlanSearch.entity().getId(), 
AssignIpAddressFromPodVlanSearch.entity().getVlanId(), JoinType.INNER);
AssignIpAddressFromPodVlanSearch.done();

public IPAddressVO doInTransaction(TransactionStatus status) throws 
InsufficientAddressCapacityException {
StringBuilder errorMessage = new StringBuilder("Unable to 
get ip adress in ");
boolean fetchFromDedicatedRange = false;
List dedicatedVlanDbIds = new ArrayList();
List nonDedicatedVlanDbIds = new ArrayList();

SearchCriteria sc = null;
if (podId != null) {
sc = **AssignIpAddressFromPodVlanSearch**.create();
sc.setJoinParameters("podVlanMapSB", "podId", podId);
errorMessage.append(" pod id=" + podId);
} else {
sc = AssignIpAddressSearch.create();
errorMessage.append(" zone id=" + dcId);
}



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-24 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@rafaelweingartner Looks  like the deadlocks 2 and 3 are the same. I 
scanned our production  log  and since last December we had 6400 deadlocks. Out 
of them close to 6000 were Deadlock 1 
20 were Deadlock 2 and 700 of a different Deadlock 5. The other deadlocks 
were in negligible numbers. I think if we figure out Deadlock 1 and Deadlock 5 
this will be good start. I will try to find the source of transactions for 
them. In production we run a commercial distribution based in most part on  4.7 
branch of ACS. 

Deadlock 5

*** (1) TRANSACTION:
TRANSACTION D518886F8, ACTIVE 2 sec fetching rows
mysql tables in use 4, locked 4
LOCK WAIT 24 lock struct(s), heap size 3112, 8 row lock(s), undo log 
entries 17
MySQL thread id 29781, OS thread handle 0x7f9df36db700, query id 3625404021 
ussclpdcsmgt012.autodesk.com 10.41.13.14 cloud Sorting result
SELECT user_ip_address.id, user_ip_address.account_id, 
user_ip_address.domain_id, user_ip_address.public_ip_address, 
user_ip_address.data_center_id, user_ip_address.source_n
at, user_ip_address.allocated, user_ip_address.vlan_db_id, 
user_ip_address.one_to_one_nat, user_ip_address.vm_id, user_ip_address.state, 
user_ip_address.mac_address, user_ip
_address.source_network_id, user_ip_address.network_id, 
user_ip_address.uuid, user_ip_address.physical_network_id, 
user_ip_address.is_system, user_ip_address.vpc_id, user_ip
_address.dnat_vmip, user_ip_address.is_portable, user_ip_address.display, 
user_ip_address.removed, user_ip_address.created FROM user_ip_address  INNER 
JOIN vlan ON user_ip_a
ddress.vlan_db_id=vlan.id WHERE user_ip_address.data_center_id = 6  AND 
user_ip_address.allocated IS NULL  AND user_ip_address.vlan_db_id IN 
(32,33,36,37,41,61,62,91,92,93,9
4,106,107,108,109,11
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
*** (2) TRANSACTION:
TRANSACTION D5188582B, ACTIVE 17 sec updating or deleting, thread declared 
inside InnoDB 499
mysql tables in use 1, locked 1
25 lock struct(s), heap size 3112, 13 row lock(s), undo log entries 18
MySQL thread id 29820, OS thread handle 0x7fa35a868700, query id 3625417999 
ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud Updating
UPDATE user_ip_address SET user_ip_address.source_nat=0, 
user_ip_address.is_system=0, user_ip_address.account_id=3309, 
user_ip_address.allocated='2016-03-25 15:36:39', 
user_ip_address.state='Allocated', user_ip_address.domain_id=335 WHERE 
user_ip_address.id = 3284
*** (2) HOLDS THE LOCK(S):



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-24 Thread rafaelweingartner
Github user rafaelweingartner commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
Thanks, @serg38.
Looking at the SQLs you posted. We could start to discuss whether or not 
some SQLs statements need locking transactions.

Ignoring Deadlocks 3 and 4 for now, I think we could start with the ones 
the look the simplest (Deadlocks 1 and 2). 

These SQLS have probably being generated, so tracking them on ACS may not 
be that easy, but at first glance, I feel that we could execute them without 
needing lock in the database. 

I tried to find the first SQL, without success. Would you mind helping me 
pin point where in the code the SQL from transaction 1 at deadlock 1 is 
generated? Then, we can evaluate if it is or not needed a lock there. 

Are the SQLs you showed complete? I found a place that could generate SQLs 
similar to the one at transaction 1 and deadlock 1, but this code adds one 
extra where clause.

The method I am talking about is:

com.cloud.cluster.agentlb.ClusterBasedAgentLoadBalancerPlanner.getHostsToRebalance(long,
 int)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-24 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
Here it is few samples of deadlocks we observe in high transaction volume 
environment with multiple management servers. As you can see most of them are 
concurrent operations from different management servers and either select or 
select for update statements. The following 4 types account for the majority of 
deadlock s we saw so far ( 80-90% of all deadlocks). Deadlock 1-3 happens much 
more often than deadlock 4.  It is next to impossible to reproduce since they 
occur one in few days with 4 management servers and average VM deployment 
volume of 3000 a day.

Deadlock type 1:

InnoDB: transactions deadlock detected, dumping detailed information.
151217  3:08:20
*** (1) TRANSACTION:
TRANSACTION BB4D4C91D, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 11 lock struct(s), heap size 3112, 5 row lock(s)
MySQL thread id 47654, OS thread handle 0x7f0475bdd700, query id 3821358107 
ussclpdcsmgt012.autodesk.com 10.41.13.14 cloud Sending data
SELECT host.id, host.disconnected, host.name, host.status, host.type, 
host.private_ip_address, host.private_mac_address, host.private_netmask, 
host.public_netmask, host.public_ip_address, host.public_mac_address, 
host.storage_ip_address, host.cluster_id, host.storage_netmask, 
host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, 
host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, 
host.resource, host.fs_type, host.available, host.setup, host.resource_state, 
host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, 
host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, 
host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, 
host.mgmt_server_id, host.dom0_memory, host.version, host.created, host.removed 
FROM host WHERE host.resource IS NOT NULL  AND host.mgmt_server_id = 
345048964870 
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
*** (2) TRANSACTION:
TRANSACTION BB4D4C915, ACTIVE 1 sec fetching rows, thread declared inside 
InnoDB 449
mysql tables in use 3, locked 3
29 lock struct(s), heap size 6960, 15 row lock(s), undo log entries 1
MySQL thread id 47623, OS thread handle 0x7f0a47074700, query id 3821724056 
ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud Copying to tmp table
SELECT host.id, host.disconnected, host.name, host.status, host.type, 
host.private_ip_address, host.private_mac_address, host.private_netmask, 
host.public_netmask, host.public_ip_address, host.public_mac_address, 
host.storage_ip_address, host.cluster_id, host.storage_netmask, 
host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, 
host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, 
host.resource, host.fs_type, host.available, host.setup, host.resource_state, 
host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, 
host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, 
host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, 
host.mgmt_server_id, host.dom0_memory, host.version, host.created, host.removed 
FROM host  LEFT OUTER JOIN op_host_transfer ON host.id=op_host_transfer.id
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 0 page no 147488 n bits 840 index `i_host__removed` 
of table `cloud`.`host` trx id BB4D4C915 lock_mode X locks rec but not gap


Deadlock 2:

InnoDB: transactions deadlock detected, dumping detailed information.
151218 11:03:00
*** (1) TRANSACTION:
TRANSACTION BBB232C81, ACTIVE 51 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1248, 2 row lock(s)
MySQL thread id 57308, OS thread handle 0x7f0a45c24700, query id 5217973695 
ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud Sending data
SELECT resource_count.id, resource_count.type, resource_count.account_id, 
resource_count.domain_id, resource_count.count FROM resource_count WHERE 
resource_count.id IN (5083,4867,5079,33652,5077)  FOR UPDATE
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
*** (2) TRANSACTION:
TRANSACTION BBB2254AC, ACTIVE 116 sec starting index read, thread declared 
inside InnoDB 500
mysql tables in use 1, locked 1
207 lock struct(s), heap size 31160, 1650 row lock(s), undo log entries 2
MySQL thread id 56926, OS thread handle 0x7f04756c9700, query id 5218549710 
ussclpdcsmgt014.autodesk.com 10.41.13.16 cloud Sending data
SELECT resource_count.id, resource_count.type, resource_count.account_id, 
resource_count.domain_id, resource_count.count FROM resource_count WHERE 
resource_count.id IN (5083,4867,5079,33652,5077)  FOR UPDATE

Deadlock 3:

** (1) TRANSACTION:
TRANSACTION BBB232C81, ACTIVE 51 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock 

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-24 Thread rafaelweingartner
Github user rafaelweingartner commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@serg38 I have just now started reading this PR (excuse me if I overlooked 
some information).

> If we are to try to implement a general way of dealing with deadlocks in 
ACS how could it be done to ensure DB consistency and correct transaction retry?

Answering your question; in my opinion, we should not “try” to 
implement a general way of managing transactions. We are only having this type 
of problem because instead of using a framework to manage access and 
transactions in databases, it was developed a module to do that and 
incorporated to ACS; this means we have to maintain and live with this code. 

Now, the problem is that it would be a Dantesque task to change the way ACS 
manages transactions today.

I am with John on this one, retrying is not a good idea; it can hide 
problems, cause overheads and cause even more headaches.  I think that the best 
approach is to deal with this type of problem on the fly; this means, as John 
said, addressing them as bugs when they are reported.

Having said that, I have not helped a bit to solve the problem… Let’s 
see if I can be of any help. 

I was reading the ticket #CLOUDSTACK-9595. It seems that the problem 
(reported there) happened when a VM was being removed from a table 
“instance_group_vm_map”. I just do not understand because the method called 
is “UserVmManagerImpl.addInstanceToGroup”. I am hoping that this makes 
sense. Anyways…

The MYSQL docs have the following on deadlocks:
> A deadlock is a situation where different transactions are unable to 
proceed because each holds a lock that the other needs

This means, there was something else being executed when that VM was 
deleted/added, and this caused the deadlock and the exception. Probably 
something else is using the table “instance_group_vm_map”.

I think we should track these two tasks/processes that can cause the 
problem and work them out, instead of looking for a generic way to deal with 
this situation. Maybe these processes that are causing deadlock are locking 
tables that are not needed or executing some processing that could be avoided 
or modified.

Do we use case that can reproduce the problem? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-24 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@rafaelweingartner @swill @wido @koushik-das @karuturi @rhtyd @jburwell  
Let's ask a different question. If we are to try to implement a general way of 
dealing with deadlocks in ACS how could it be done to ensure DB consistency and 
correct transaction retry?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-24 Thread abhinandanprateek
Github user abhinandanprateek commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
Even trying the full transaction again could be problematic as there might 
be checks done before firing the transaction that may not be valid now.
The thing is it may mostly work, but it is not fool proof.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-23 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
What about if the author can figure out a way to identify all part of 
transaction being cancelled and retry all parts? Or retry the whole 
transaction? It would  be nice  to open a path for the author to implement this 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-23 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@serg38 corruption could happen at any point -- it's a ticking time bomb.  
From a ACID perspective, this patch fails from a consistency perspective.  All 
data being updated must be re-queried and validated in order to ensure the 
consistency guarantee is not violated.  In a high volume system, it's not a 
matter of if, but when a sequence of events will occur and corrupt the 
database.   Bear in mind, these corruptions be in the content of the data and 
would not yield a MySQL error.  They will be phenomenon such as phantom rows or 
inconsistent data updates

As I said previously, the only real solution to deadlocks is to fix the way 
the system manages transactions and locks.  This patch is merely hiding an 
error while creating the potential for far larger problems.

For these reasons, I remain -1 on merging this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-23 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@jburwell We've been running this fix as a part of proprietary CS for 
several weeks now. We are observing elimination of deadlocks and no DB 
corruption. Retry seems to be the only realistic way of dealing with deadlocks 
in complex environment like ACS. Can we come up with a limited scope/conditions 
of this PR to move forward ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-23 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@rhtyd I am -1 on this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-23 Thread rhtyd
Github user rhtyd commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@abhinandanprateek can you help reviewing this one, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-20 Thread blueorangutan
Github user blueorangutan commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
Trillian test result (tid-347)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 26094 seconds
Marvin logs: 
https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr1762-t347-kvm-centos7.zip
Test completed. 47 look ok, 1 have error(s)


Test | Result | Time (s) | Test File
--- | --- | --- | ---
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL | `Failure` | 369.46 
| test_vpc_redundant.py
test_01_vpc_site2site_vpn | Success | 154.87 | test_vpc_vpn.py
test_01_vpc_remote_access_vpn | Success | 66.24 | test_vpc_vpn.py
test_01_redundant_vpc_site2site_vpn | Success | 255.75 | test_vpc_vpn.py
test_02_VPC_default_routes | Success | 273.12 | test_vpc_router_nics.py
test_01_VPC_nics_after_destroy | Success | 534.12 | test_vpc_router_nics.py
test_05_rvpc_multi_tiers | Success | 513.09 | test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics | Success | 1408.56 | 
test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers | 
Success | 553.46 | test_vpc_redundant.py
test_02_redundant_VPC_default_routes | Success | 753.12 | 
test_vpc_redundant.py
test_09_delete_detached_volume | Success | 15.44 | test_volumes.py
test_08_resize_volume | Success | 15.36 | test_volumes.py
test_07_resize_fail | Success | 20.45 | test_volumes.py
test_06_download_detached_volume | Success | 15.29 | test_volumes.py
test_05_detach_volume | Success | 100.25 | test_volumes.py
test_04_delete_attached_volume | Success | 10.18 | test_volumes.py
test_03_download_attached_volume | Success | 15.30 | test_volumes.py
test_02_attach_volume | Success | 73.79 | test_volumes.py
test_01_create_volume | Success | 712.21 | test_volumes.py
test_deploy_vm_multiple | Success | 278.61 | test_vm_life_cycle.py
test_deploy_vm | Success | 0.03 | test_vm_life_cycle.py
test_advZoneVirtualRouter | Success | 0.02 | test_vm_life_cycle.py
test_10_attachAndDetach_iso | Success | 26.47 | test_vm_life_cycle.py
test_09_expunge_vm | Success | 125.19 | test_vm_life_cycle.py
test_08_migrate_vm | Success | 35.86 | test_vm_life_cycle.py
test_07_restore_vm | Success | 0.10 | test_vm_life_cycle.py
test_06_destroy_vm | Success | 125.83 | test_vm_life_cycle.py
test_03_reboot_vm | Success | 125.82 | test_vm_life_cycle.py
test_02_start_vm | Success | 10.16 | test_vm_life_cycle.py
test_01_stop_vm | Success | 40.30 | test_vm_life_cycle.py
test_CreateTemplateWithDuplicateName | Success | 75.58 | test_templates.py
test_08_list_system_templates | Success | 0.04 | test_templates.py
test_07_list_public_templates | Success | 0.04 | test_templates.py
test_05_template_permissions | Success | 0.06 | test_templates.py
test_04_extract_template | Success | 5.18 | test_templates.py
test_03_delete_template | Success | 5.10 | test_templates.py
test_02_edit_template | Success | 90.17 | test_templates.py
test_01_create_template | Success | 70.57 | test_templates.py
test_10_destroy_cpvm | Success | 131.62 | test_ssvm.py
test_09_destroy_ssvm | Success | 163.18 | test_ssvm.py
test_08_reboot_cpvm | Success | 101.42 | test_ssvm.py
test_07_reboot_ssvm | Success | 133.53 | test_ssvm.py
test_06_stop_cpvm | Success | 166.54 | test_ssvm.py
test_05_stop_ssvm | Success | 133.56 | test_ssvm.py
test_04_cpvm_internals | Success | 0.95 | test_ssvm.py
test_03_ssvm_internals | Success | 3.27 | test_ssvm.py
test_02_list_cpvm_vm | Success | 0.11 | test_ssvm.py
test_01_list_sec_storage_vm | Success | 0.12 | test_ssvm.py
test_01_snapshot_root_disk | Success | 16.20 | test_snapshots.py
test_04_change_offering_small | Success | 209.44 | test_service_offerings.py
test_03_delete_service_offering | Success | 0.03 | test_service_offerings.py
test_02_edit_service_offering | Success | 0.05 | test_service_offerings.py
test_01_create_service_offering | Success | 0.10 | test_service_offerings.py
test_02_sys_template_ready | Success | 0.12 | test_secondary_storage.py
test_01_sys_vm_start | Success | 0.17 | test_secondary_storage.py
test_09_reboot_router | Success | 30.26 | test_routers.py
test_08_start_router | Success | 25.25 | test_routers.py
test_07_stop_router | Success | 10.15 | test_routers.py
test_06_router_advanced | Success | 0.05 | test_routers.py
test_05_router_basic | Success | 0.04 | test_routers.py
test_04_restart_network_wo_cleanup | Success | 5.59 | test_routers.py
test_03_restart_network_cleanup | Success | 50.47 | test_routers.py
test_02_router_internal_adv | Success | 1.09 | test_routers.py
test_01_router_internal_basic | Success | 0.58 | test_routers.py
test_router_dns_guestipquery | Success | 76.67 | test_router_dns.py

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-20 Thread blueorangutan
Github user blueorangutan commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been 
kicked to run smoke tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-20 Thread rhtyd
Github user rhtyd commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@blueorangutan test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-20 Thread blueorangutan
Github user blueorangutan commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
Packaging result: ✔centos6 ✔centos7 ✔debian. JID-164


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-19 Thread blueorangutan
Github user blueorangutan commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@rhtyd a Jenkins job has been kicked to build packages. I'll keep you 
posted as I make progress.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-19 Thread rhtyd
Github user rhtyd commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@blueorangutan package


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-16 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
Due to the previous discussion, I am -1 on merging this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-15 Thread cloudmonger
Github user cloudmonger commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
### ACS CI BVT Run
 **Sumarry:**
 Build Number 135
 Hypervisor xenserver
 NetworkType Advanced
 Passed=102
 Failed=3
 Skipped=6

_Link to logs Folder (search by build_no):_ 
https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0


**Failed tests:**
* test_non_contigiousvlan.py

 * test_extendPhysicalNetworkVlan Failed

* test_deploy_vm_iso.py

 * test_deploy_vm_from_iso Failing since 19 runs

* test_vm_life_cycle.py

 * test_10_attachAndDetach_iso Failing since 20 runs


**Skipped tests:**
test_01_test_vm_volume_snapshot
test_vm_nic_adapter_vmxnet3
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

**Passed test suits:**
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_snapshots.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_login.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_volumes.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_routers_network_ops.py
test_disk_offerings.py


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-14 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@serg38 with custom plugins, there is no way to reliably perform such 
tracing.  I can think of batch cleanup operations in the storage layer that 
follow the pattern I described.  Even if there were, we would have planted a 
landline for future changes to the system.  Deadlocks are significant technical 
debt that are clearly causing significant operational issues.  Unfortunately, 
there is no way to address them generically


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-14 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@jburwell I concur but if @yvsubhash verified that those methods don't 
participate in complex DML transactions this might be still a good start. If so 
this approach might be expanded later to multi DML transaction so that each 
piece can be retired individually. I myself traced few deadlocks in ACS using  
native mysql deadlock logging and it doesn't seem there would be a viable 
alternative to retires due to well known complexity of ACS DB operations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-14 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@serg38 there remains a risk when those methods are executed in the context 
of an open transaction where DMLs have already been executed and subsequent 
DMLs will be executed.  In this scenario, the first set of the changes would be 
lost due to the rollback triggered by the query deadlock with the second set 
proceeding successfully.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-14 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@jburwell @yvsubhash  I might be wrong but this PR will retry on deadlock 
for only 2  DAO methods searchIncludingRemoved and 
customSearchIncludingRemoved. No update methods are set with this retry 
mechanism. If that's the case there is no risk of corrupting DB. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-14 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@serg38 that is not a safe assumption.  Transactions often span multiple 
statements and methods across DAOs.  `TransactionLegacy` has a transaction 
stacking/nested model that further occludes when a transaction actually 
completely.

Deadlocks are a severe problem that need to be fixed.  Unfortunately, this 
patch would do more harm than good as it would eventually corrupt the database. 
  In, and of themselves, retries are also a very expensive solution to the 
problem both in terms of the engineering effort required to do it properly and 
the extra stress placed on the database to perform additional work that will 
likely fail.  Furthermore, a generic **and** correct retry mechanism is a very 
difficult thing to write.  Given the way transaction boundaries are managed in 
ACS, I think such an effort would be nearly impossible.

In a properly written application, deadlocks should very rarely, if ever, 
occur.  Their presence is a symptom of improper transaction handling and/or 
poor lock management problems.   Therefore, my suggestion is that we change 
this patch to log details about the context in which deadlocks occur.  We can 
then use this information to identify the areas in ACS where these contention 
problems are location and fix the root cause.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-14 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@jburwell I thought that most if not all of ACS interaction through DAO is 
rather atomic transactions. Do we have cases of multiple DML statements as a 
part of the same transaction? We have been seeing quite a few deadlock in a 
high transaction volume environments where multiple management servers are 
employed. This causes quite a pain for users due to the randomness and no good 
recourse/explanation. I would argue that proper retry is a better choice should 
we cover all the cases including all cases with complex transactions. We have 
been successful leveraging this approach in systems built on the top of ACS.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-14 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@serg38 my reading of the code is that only the most recently attempted DML 
will be re-executed.  Furthermore, retrying without refreshing the base data 
can also lead to data corruption.  The best thing to do in a case of a dead 
lock is to fail and rollback due to the risk of data corruption.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-14 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@jburwell @yvsubhash My understanding that all roll back statements will 
receive MYSQL_DEADLOCK_ERROR_CODE  and will be retired as a part of this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-14 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
@yvsubhash according to the (MySQL deadlock 
documenation)[http://dev.mysql.com/doc/refman/5.7/en/innodb-deadlocks.html],  a 
`MYSQL_DEADLOCK_ERROR_CODE` error indicates the enclosing transaction has been 
rolled back.  The proper handling for this error is to re-execute all 
statements executed in the aborted transaction.  From a best practices 
perspective, all base data should be re-retrieved and changed to ensure logical 
consistency with changes made by the transaction that won deadlock resolution.

As I understand this patch, only the most recently executed DML is retried. 
 Therefore, any previously executed changes will be discarded and the DML will 
be re-executed either in a new transaction or in auto-commit (I didn't look up 
how the client handles the transaction context in this scenario).  If my 
understanding is correct, this patch could lead to issues ranging from 
unexpected foreign key integrity errors to data corruption.

Rather attempting to implement a generic retry, I think the best approach 
to addressing deadlocks is to treat them bugs.  This patch could be modified to 
provide detailed logging information about the conditions under which a 
deadlock occurs providing the information necessary to refactor the system to 
avoid lock contention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

2016-11-11 Thread serg38
Github user serg38 commented on the issue:

https://github.com/apache/cloudstack/pull/1762
  
LGTM. Finally !!! We have been seeing occasional deadlocks in environments 
with high level transaction rate. @rhtyd @jburwell This could be a good add to 
4.8/4.9. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---