[jira] [Commented] (YUNIKORN-30) flaky tests cause build failures on PRs
[ https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065342#comment-17065342 ] Wilfred Spiegelenburg commented on YUNIKORN-30: --- PR is open. The current fixed code has passed 5 out of 6 times in the github action. The one single failure was outside of smoke tests so it looks like I have got it now. > flaky tests cause build failures on PRs > --- > > Key: YUNIKORN-30 > URL: https://issues.apache.org/jira/browse/YUNIKORN-30 > Project: Apache YuniKorn > Issue Type: Test > Components: test - smoke >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Labels: pull-request-available > Attachments: TestBasicScheduler_github_fail.log > > Time Spent: 10m > Remaining Estimate: 0h > > Smoke tests have been failing on PR triggered builds. > Failures are inconsistent and linked to multiple test cases, failures in the > same tests can even happen in different lines of code in different runs > without changes: > {code} > 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s) > 2020-03-11T04:39:40.8340886Z ##[error]mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: > TestSchedulerRecovery in scheduler_recovery_test.go:213 > {code} > {code} > 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s) > 2020-03-11T04:39:40.9103549Z ##[error]mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: TestBasicScheduler > in scheduler_smoke_test.go:341 > {code} > {code} > 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s) > 2020-03-06T07:17:50.4574239Z ##[error]scheduler_reservation_test.go:276: > partition reservations are missing > {code} > {code} > 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s) > 2020-03-06T08:08:21.8917559Z ##[error]scheduler_utils.go:79: Failed to > wait for pending resource, expected 80, actual 60, called from: > TestRemoveReservedNode in scheduler_reservation_test.go:356 > {code} > {code} > 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s) > 2020-03-04T10:42:16.5789359Z ##[error]scheduler_reservation_test.go:357: > assertion failed: 2 (int) != 1 (int): reservations missing from app > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-30) flaky tests cause build failures on PRs
[ https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065092#comment-17065092 ] Weiwei Yang commented on YUNIKORN-30: - Hi [~wilfreds] This is great, basically, this issue only happens when we run all the tests together. And some of them got interferences from previous runs, correct? Do you have a fix for this now? > flaky tests cause build failures on PRs > --- > > Key: YUNIKORN-30 > URL: https://issues.apache.org/jira/browse/YUNIKORN-30 > Project: Apache YuniKorn > Issue Type: Test > Components: test - smoke >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Attachments: TestBasicScheduler_github_fail.log > > > Smoke tests have been failing on PR triggered builds. > Failures are inconsistent and linked to multiple test cases, failures in the > same tests can even happen in different lines of code in different runs > without changes: > {code} > 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s) > 2020-03-11T04:39:40.8340886Z ##[error]mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: > TestSchedulerRecovery in scheduler_recovery_test.go:213 > {code} > {code} > 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s) > 2020-03-11T04:39:40.9103549Z ##[error]mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: TestBasicScheduler > in scheduler_smoke_test.go:341 > {code} > {code} > 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s) > 2020-03-06T07:17:50.4574239Z ##[error]scheduler_reservation_test.go:276: > partition reservations are missing > {code} > {code} > 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s) > 2020-03-06T08:08:21.8917559Z ##[error]scheduler_utils.go:79: Failed to > wait for pending resource, expected 80, actual 60, called from: > TestRemoveReservedNode in scheduler_reservation_test.go:356 > {code} > {code} > 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s) > 2020-03-04T10:42:16.5789359Z ##[error]scheduler_reservation_test.go:357: > assertion failed: 2 (int) != 1 (int): reservations missing from app > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-30) flaky tests cause build failures on PRs
[ https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064546#comment-17064546 ] Wilfred Spiegelenburg commented on YUNIKORN-30: --- I have found the issue and have proof of the it finally: {code} 2020-03-23T14:52:59.893+1100DEBUG scheduler/scheduling_application.go:529 app reservation check {"allocationKey": "alloc-3", "createTime": "2020-03-23T14:52:59.880+1100", "askAge": "13.39217ms", "reservation delay": "10ms"} {code} This was logged in a failure case for {{TestBasicScheduler}}. This test does not set the reservation delay and it should still be set to the standard 2s. It seems to have picked up the setting from a previous run. I have attached a full log to show the whole run. If the test would set the delay we should have seen a line like this in the log: {code} 2020-03-23T14:52:57.772+1100DEBUG scheduler/scheduling_application.go:65 Test override reservation delay {"delay": "10ms"} {code} That line is nowhere in the logs. > flaky tests cause build failures on PRs > --- > > Key: YUNIKORN-30 > URL: https://issues.apache.org/jira/browse/YUNIKORN-30 > Project: Apache YuniKorn > Issue Type: Test > Components: test - smoke >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Attachments: TestBasicScheduler_github_fail.log > > > Smoke tests have been failing on PR triggered builds. > Failures are inconsistent and linked to multiple test cases, failures in the > same tests can even happen in different lines of code in different runs > without changes: > {code} > 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s) > 2020-03-11T04:39:40.8340886Z ##[error]mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: > TestSchedulerRecovery in scheduler_recovery_test.go:213 > {code} > {code} > 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s) > 2020-03-11T04:39:40.9103549Z ##[error]mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: TestBasicScheduler > in scheduler_smoke_test.go:341 > {code} > {code} > 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s) > 2020-03-06T07:17:50.4574239Z ##[error]scheduler_reservation_test.go:276: > partition reservations are missing > {code} > {code} > 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s) > 2020-03-06T08:08:21.8917559Z ##[error]scheduler_utils.go:79: Failed to > wait for pending resource, expected 80, actual 60, called from: > TestRemoveReservedNode in scheduler_reservation_test.go:356 > {code} > {code} > 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s) > 2020-03-04T10:42:16.5789359Z ##[error]scheduler_reservation_test.go:357: > assertion failed: 2 (int) != 1 (int): reservations missing from app > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-30) flaky tests cause build failures on PRs
[ https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064503#comment-17064503 ] Wilfred Spiegelenburg commented on YUNIKORN-30: --- Not sure if this is just logging or showing the go routine scheduling at work which might be some of the impact that we have in the git workflows. The shutdown is logged before we process the last events: {code} 2020-03-20T16:37:20.061+1100 DEBUG cache/partition_info.go:581 added allocation \{"partitionName": "[rm:123]default", "appID": "app-3", "allocationUid": "31467f7e-b7b6-4f39-b823-cc1a6536a766", "allocKey": "alloc-3"} 2020-03-20T16:37:20.061+1100 DEBUG cache/partition_info.go:581 added allocation \{"partitionName": "[rm:123]default", "appID": "app-3", "allocationUid": "31467f7e-b7b6-4f39-b823-cc1a6536a766", "allocKey": "alloc-3"} 2020-03-20T16:37:20.061+1100 INFO entrypoint/service_context.go:38 ServiceContext stop all services 2020-03-20T16:37:20.061+1100 DEBUG scheduler/scheduler.go:176 enqueued event \{"eventType": "*schedulerevent.SchedulerAllocationUpdatesEvent", "event": {"RejectedAllocations":null,"AcceptedAllocations":[{"NodeID":"node-2","ApplicationID":"app-3","QueueName":"root.leaf-2","AllocatedResource":{"Resources":{"memory":5,"vcore":5}},"AllocationKey":"alloc-3","Tags":null,"Priority":null,"PartitionName":"[rm:123]default"}],"NewAsks":null,"ToReleases":null,"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0} 2020-03-20T16:37:20.061+1100 DEBUG scheduler/scheduling_partition.go:434 allocation confirmation on partition \{"partition": "[rm:123]default", "appID": "app-3", "nodeID": "node-2", "allocKey": "alloc-3", "confirmation": true} --- PASS: TestReservationForTwoQueues (0.11s) {code} > flaky tests cause build failures on PRs > --- > > Key: YUNIKORN-30 > URL: https://issues.apache.org/jira/browse/YUNIKORN-30 > Project: Apache YuniKorn > Issue Type: Test > Components: test - smoke >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Attachments: TestBasicScheduler_github_fail.log > > > Smoke tests have been failing on PR triggered builds. > Failures are inconsistent and linked to multiple test cases, failures in the > same tests can even happen in different lines of code in different runs > without changes: > {code} > 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s) > 2020-03-11T04:39:40.8340886Z ##[error]mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: > TestSchedulerRecovery in scheduler_recovery_test.go:213 > {code} > {code} > 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s) > 2020-03-11T04:39:40.9103549Z ##[error]mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: TestBasicScheduler > in scheduler_smoke_test.go:341 > {code} > {code} > 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s) > 2020-03-06T07:17:50.4574239Z ##[error]scheduler_reservation_test.go:276: > partition reservations are missing > {code} > {code} > 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s) > 2020-03-06T08:08:21.8917559Z ##[error]scheduler_utils.go:79: Failed to > wait for pending resource, expected 80, actual 60, called from: > TestRemoveReservedNode in scheduler_reservation_test.go:356 > {code} > {code} > 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s) > 2020-03-04T10:42:16.5789359Z ##[error]scheduler_reservation_test.go:357: > assertion failed: 2 (int) != 1 (int): reservations missing from app > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-30) flaky tests cause build failures on PRs
[ https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063052#comment-17063052 ] Wilfred Spiegelenburg commented on YUNIKORN-30: --- Further details: I have what I think is the root cause behind some of the failures. I described one of the cases above that I found but some others show a different failure type. * Incorrect partition name: we seem to call normalise on the partition name twice during our testing. {code} 2020-03-19T21:31:53.9142459Z 2020-03-19T21:31:41.042Z INFO cache/cluster_info.go:618 Failed to find partition for allocation proposal{"partitionName": "[rm:123][rm:123]default"} {code} This should never be a problem and shows a bug in the code and we should be able to handle this. The fix is in the normalisation code to check if it is already normalised. * Event handling: a generic underlying issue. During some local testing I noticed that we do not properly wait for the event handling to process all the events that are generated. In the case observed: allocation releases were still being processed while the end state check was performed. Those issues can be fixed by a proper wait in the test code. * However in certain failures we see nothing. This could point to a problem with go routines not being scheduled. The logs for these cases show a blank period of about 1 sec (the max time we wait for things) between the normal processing and the wait timing out. I cannot really reproduce those yet. Working on a PR to fix at least the majority of what I have found. > flaky tests cause build failures on PRs > --- > > Key: YUNIKORN-30 > URL: https://issues.apache.org/jira/browse/YUNIKORN-30 > Project: Apache YuniKorn > Issue Type: Test > Components: test - smoke >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Attachments: TestBasicScheduler_github_fail.log > > > Smoke tests have been failing on PR triggered builds. > Failures are inconsistent and linked to multiple test cases, failures in the > same tests can even happen in different lines of code in different runs > without changes: > {code} > 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s) > 2020-03-11T04:39:40.8340886Z ##[error]mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: > TestSchedulerRecovery in scheduler_recovery_test.go:213 > {code} > {code} > 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s) > 2020-03-11T04:39:40.9103549Z ##[error]mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: TestBasicScheduler > in scheduler_smoke_test.go:341 > {code} > {code} > 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s) > 2020-03-06T07:17:50.4574239Z ##[error]scheduler_reservation_test.go:276: > partition reservations are missing > {code} > {code} > 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s) > 2020-03-06T08:08:21.8917559Z ##[error]scheduler_utils.go:79: Failed to > wait for pending resource, expected 80, actual 60, called from: > TestRemoveReservedNode in scheduler_reservation_test.go:356 > {code} > {code} > 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s) > 2020-03-04T10:42:16.5789359Z ##[error]scheduler_reservation_test.go:357: > assertion failed: 2 (int) != 1 (int): reservations missing from app > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org