Re: [openstack-dev] [mistral] bugfix for "Fix concurrency issues by using READ_COMMITTED" unveils / creates a different bug
Excerpts from ELISHA, Moshe (Moshe)'s message of 2015-12-07 08:29:44 -0800: > Hi all, > > The current bugfix I am working on[1] have unveiled / created a bug. > Test "WorkflowResumeTest.test_resume_different_task_states" sometimes fails > because "task4" is executed twice instead of once (See unit test output and > workflow below). > > This happens because task2 on-complete is running task4 as expected but also > task3 executes task4 by mistake. > > It is not consistent but it happens quite often - This happens if the unit > test resumes the WF and updates action execution of task2 and finishes task2 > before task3 is finished. > Scenario: > > > 1. Task2 in method on_action_complete - changes task2 state to RUNNING. > > 2. Task3 in method on_action_complete - changes task2 state to RUNNING > (before task2 calls _on_task_state_change). > > 3. Task3 in "_on_task_state_change" > "continue_workflow" > > "DirectWorkflowController ._find_next_commands" - it finds task2 because > task2 is in SUCCESS and processed = False and > "_find_next_commands_for_task(task2)" returns task4. > > 4. Task3 executes command to RunTask task4. > > 5. Task2 in "_on_task_state_change" > "continue_workflow" > > "DirectWorkflowController ._find_next_commands" - it finds task2 because > task2 is in SUCCESS and processed = False and > "_find_next_commands_for_task(task2)" returns task4. > > 6. Task2 executes command to RunTask task4. > > > [1] - https://review.openstack.org/#/c/253819/ > > > If I am not mistaken - this issue also exists in the current code and my > bugfix only made it much more often. Can you confirm? > I don't have enough knowledge on how to fix this issue... > For now - I have modified the test_resume_different_task_states unit test to > wait for task3 to be processed before updating the action execution of task2. > If you agree this bug exist today as well - we can proceed with my bugfix and > open a different bug for that issue. > I'd agree that this is likely happening more reliably because READ COMMITTED will just give you the state that causes the bug more often than REPEATABLE READ, because now if you happen to have threads running at the same time when the new vulnerable state is reached, they can both see the new state and react to it. Before they only had that problem if they both started after the enabling state was in the db, thus sharing the same db snapshot. What you actually need is to atomically claim the rows. You have to do like this: UPDATE whatever_table SET executor = 'me' WHERE executor IS NULL; And if you get 0 rows updated, that means somebody else claimed it and you do nothing. Note that you also need some liveness testing in this system, since if your executor dies, that row will be lost forever. In Heat they solved it by having a queue for each executor and pinging on oslo.messaging. Please don't do that. I suggest instead switching to tooz, and joining the distributed lock revolution, where zookeeper will give you a nice atomic distributed lock for this, and detect when to break it because of a dead executor. (or consul or etcd once our fine community finishes landing those) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [mistral] bugfix for "Fix concurrency issues by using READ_COMMITTED" unveils / creates a different bug
Hi, I have reproduced this issue multiple times before your fix was merged. In order to reproduce I used a workflow with multiple async actions, and resumed all of them at the same time. I just created a ticket in launchpad [1], with the workflow used and the mistral engine logs. [1] - https://bugs.launchpad.net/mistral/+bug/1524477 If anyone could take a look and confirm the bug it would be great. Thanks Noa Koffman From: ELISHA, Moshe (Moshe) [moshe.eli...@alcatel-lucent.com] Sent: Monday, December 07, 2015 6:29 PM To: OpenStack Development Mailing List (not for usage questions) Subject: [openstack-dev] [mistral] bugfix for "Fix concurrency issues by using READ_COMMITTED" unveils / creates a different bug Hi all, The current bugfix I am working on[1] have unveiled / created a bug. Test “WorkflowResumeTest.test_resume_different_task_states” sometimes fails because “task4” is executed twice instead of once (See unit test output and workflow below). This happens because task2 on-complete is running task4 as expected but also task3 executes task4 by mistake. It is not consistent but it happens quite often – This happens if the unit test resumes the WF and updates action execution of task2 and finishes task2 before task3 is finished. Scenario: 1. Task2 in method on_action_complete – changes task2 state to RUNNING. 2. Task3 in method on_action_complete – changes task2 state to RUNNING (before task2 calls _on_task_state_change). 3. Task3 in “_on_task_state_change” > “continue_workflow” > “DirectWorkflowController ._find_next_commands” – it finds task2 because task2 is in SUCCESS and processed = False and “_find_next_commands_for_task(task2)” returns task4. 4. Task3 executes command to RunTask task4. 5. Task2 in “_on_task_state_change” > “continue_workflow” > “DirectWorkflowController ._find_next_commands” – it finds task2 because task2 is in SUCCESS and processed = False and “_find_next_commands_for_task(task2)” returns task4. 6. Task2 executes command to RunTask task4. [1] - https://review.openstack.org/#/c/253819/ If I am not mistaken – this issue also exists in the current code and my bugfix only made it much more often. Can you confirm? I don’t have enough knowledge on how to fix this issue… For now – I have modified the test_resume_different_task_states unit test to wait for task3 to be processed before updating the action execution of task2. If you agree this bug exist today as well – we can proceed with my bugfix and open a different bug for that issue. Thanks. [stack@melisha-devstack mistral(keystone_admin)]$ tox -e py27 -- WorkflowResumeTest.test_resume_different_task_states ... == FAIL: mistral.tests.unit.engine.test_workflow_resume.WorkflowResumeTest.test_resume_different_task_states tags: worker-0 -- pythonlogging:'': {{{WARNING [oslo_db.sqlalchemy.utils] Id not in sort_keys; is sort_keys unique?}}} stderr: {{{ /opt/stack/mistral/.tox/py27/local/lib/python2.7/site-packages/novaclient/v2/client.py:109: UserWarning: 'novaclient.v2.client.Client' is not designed to be initialized directly. It is inner class of novaclient. Please, use 'novaclient.client.Client' instead. Related lp bug-report: 1493576 _LW("'novaclient.v2.client.Client' is not designed to be " }}} stdout: {{{ Engine test case exception occurred: 4 != 5 Exception type: Printing workflow executions... wb.wf1 [state=SUCCESS, output={u'__execution': {u'params': {}, u'id': u'2807dd99-ca6f-49d7-886d-7d3b79e1c49e', u'spec': {u'type': u'direct', u'name': u'wf1', u'tasks': {u'task4': {u'type': u'direct', u'name': u'task4', u'version': u'2.0', u'action': u'std.echo output="Task 4"'}, u'task2': {u'type': u'direct', u'name': u'task2', u'on-complete': [u'task4'], u'version': u'2.0', u'action': u'std.mistral_http url="http://google.com";'}, u'task3': {u'type': u'direct', u'name': u'task3', u'version': u'2.0', u'action': u'std.echo output="Task 3"'}, u'task1': {u'type': u'direct', u'name': u'task1', u'on-complete': [u'task3', u'pause'], u'version': u'2.0', u'action': u'std.echo output="Hi!"'}}, u'version': u'2.0'}, u'input': {}
[openstack-dev] [mistral] bugfix for "Fix concurrency issues by using READ_COMMITTED" unveils / creates a different bug
Hi all, The current bugfix I am working on[1] have unveiled / created a bug. Test "WorkflowResumeTest.test_resume_different_task_states" sometimes fails because "task4" is executed twice instead of once (See unit test output and workflow below). This happens because task2 on-complete is running task4 as expected but also task3 executes task4 by mistake. It is not consistent but it happens quite often - This happens if the unit test resumes the WF and updates action execution of task2 and finishes task2 before task3 is finished. Scenario: 1. Task2 in method on_action_complete - changes task2 state to RUNNING. 2. Task3 in method on_action_complete - changes task2 state to RUNNING (before task2 calls _on_task_state_change). 3. Task3 in "_on_task_state_change" > "continue_workflow" > "DirectWorkflowController ._find_next_commands" - it finds task2 because task2 is in SUCCESS and processed = False and "_find_next_commands_for_task(task2)" returns task4. 4. Task3 executes command to RunTask task4. 5. Task2 in "_on_task_state_change" > "continue_workflow" > "DirectWorkflowController ._find_next_commands" - it finds task2 because task2 is in SUCCESS and processed = False and "_find_next_commands_for_task(task2)" returns task4. 6. Task2 executes command to RunTask task4. [1] - https://review.openstack.org/#/c/253819/ If I am not mistaken - this issue also exists in the current code and my bugfix only made it much more often. Can you confirm? I don't have enough knowledge on how to fix this issue... For now - I have modified the test_resume_different_task_states unit test to wait for task3 to be processed before updating the action execution of task2. If you agree this bug exist today as well - we can proceed with my bugfix and open a different bug for that issue. Thanks. [stack@melisha-devstack mistral(keystone_admin)]$ tox -e py27 -- WorkflowResumeTest.test_resume_different_task_states ... == FAIL: mistral.tests.unit.engine.test_workflow_resume.WorkflowResumeTest.test_resume_different_task_states tags: worker-0 -- pythonlogging:'': {{{WARNING [oslo_db.sqlalchemy.utils] Id not in sort_keys; is sort_keys unique?}}} stderr: {{{ /opt/stack/mistral/.tox/py27/local/lib/python2.7/site-packages/novaclient/v2/client.py:109: UserWarning: 'novaclient.v2.client.Client' is not designed to be initialized directly. It is inner class of novaclient. Please, use 'novaclient.client.Client' instead. Related lp bug-report: 1493576 _LW("'novaclient.v2.client.Client' is not designed to be " }}} stdout: {{{ Engine test case exception occurred: 4 != 5 Exception type: Printing workflow executions... wb.wf1 [state=SUCCESS, output={u'__execution': {u'params': {}, u'id': u'2807dd99-ca6f-49d7-886d-7d3b79e1c49e', u'spec': {u'type': u'direct', u'name': u'wf1', u'tasks': {u'task4': {u'type': u'direct', u'name': u'task4', u'version': u'2.0', u'action': u'std.echo output="Task 4"'}, u'task2': {u'type': u'direct', u'name': u'task2', u'on-complete': [u'task4'], u'version': u'2.0', u'action': u'std.mistral_http url="http://google.com";'}, u'task3': {u'type': u'direct', u'name': u'task3', u'version': u'2.0', u'action': u'std.echo output="Task 3"'}, u'task1': {u'type': u'direct', u'name': u'task1', u'on-complete': [u'task3', u'pause'], u'version': u'2.0', u'action': u'std.echo output="Hi!"'}}, u'version': u'2.0'}, u'input': {}}, u'task4': u'Task 4', u'task3': u'Task 3', u'__tasks': {u'848c6e92-b1b1-4d54-b11d-c93cfb4fc88f': u'task2', u'00a546e7-8da9-4603-b6be-54d58b14c625': u'task1'}}] task2 [id=848c6e92-b1b1-4d54-b11d-c93cfb4fc88f, state=SUCCESS, published={}] task1 [id=00a546e7-8da9-4603-b6be-54d58b14c625, state=SUCCESS, published={}] task3 [id=8ce20324-7fba-4424-bcd2-1e0c9b27fd4a, state=SUCCESS, published={}] task4 [id=3758de43-9bc3-4ac9-b3f3-29eb543b16ef, state=SUCCESS, published={}] task4 [id=f12ee464-0ba5-48c7-8423-9f425a00e675, state=SUCCESS, published={}] }}} Traceback (most recent call last): File "mistral/tests/unit/engine/test_workflow_resume.py", line 321, in test_resume_different_task_states self.assertEqual(4, len(wf_ex.task_executions)) File "/opt/stack/mistral/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 350, in assertEqual self.assertThat(observed, matcher, message) File "/opt/stack/mistral/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 435, in assertThat raise mismatch_error testtools.matchers._impl.MismatchError: 4 != 5 Ran 1 tests in 15.517s (+2.591s) FAILED (id=301, failures=1 (+1)) error: testr failed (1) ERROR: InvocationError: '/opt/stack/mistral/.tox/py27/bin/python setup.py testr --slowest --testr-args=WorkflowResumeTest.test_resume_different_task_states' ___