[jira] [Commented] (YUNIKORN-30) flaky tests cause build failures on PRs

2020-03-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065342#comment-17065342
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-30:
---

PR is open.
The current fixed code has passed 5 out of 6 times in the github action. The 
one single failure was outside of smoke tests so it looks like I have got it 
now.

> flaky tests cause build failures on PRs
> ---
>
> Key: YUNIKORN-30
> URL: https://issues.apache.org/jira/browse/YUNIKORN-30
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: test - smoke
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: TestBasicScheduler_github_fail.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Smoke tests have been failing on PR triggered builds.
> Failures are inconsistent and linked to multiple test cases, failures in the 
> same tests can even happen in different lines of code in different runs 
> without changes:
> {code}
> 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s)
> 2020-03-11T04:39:40.8340886Z ##[error]mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: 
> TestSchedulerRecovery in scheduler_recovery_test.go:213
> {code}
> {code}
> 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s)
> 2020-03-11T04:39:40.9103549Z ##[error]mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: TestBasicScheduler 
> in scheduler_smoke_test.go:341
> {code}
> {code}
> 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s)
> 2020-03-06T07:17:50.4574239Z ##[error]scheduler_reservation_test.go:276: 
> partition reservations are missing
> {code}
> {code}
> 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s)
> 2020-03-06T08:08:21.8917559Z ##[error]scheduler_utils.go:79: Failed to 
> wait for pending resource, expected 80, actual 60, called from: 
> TestRemoveReservedNode in scheduler_reservation_test.go:356
> {code}
> {code}
> 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s)
> 2020-03-04T10:42:16.5789359Z ##[error]scheduler_reservation_test.go:357: 
> assertion failed: 2 (int) != 1 (int): reservations missing from app
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-30) flaky tests cause build failures on PRs

2020-03-23 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065092#comment-17065092
 ] 

Weiwei Yang commented on YUNIKORN-30:
-

Hi [~wilfreds]

This is great, basically, this issue only happens when we run all the tests 
together. And some of them got interferences from previous runs, correct? Do 
you have a fix for this now? 

> flaky tests cause build failures on PRs
> ---
>
> Key: YUNIKORN-30
> URL: https://issues.apache.org/jira/browse/YUNIKORN-30
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: test - smoke
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Attachments: TestBasicScheduler_github_fail.log
>
>
> Smoke tests have been failing on PR triggered builds.
> Failures are inconsistent and linked to multiple test cases, failures in the 
> same tests can even happen in different lines of code in different runs 
> without changes:
> {code}
> 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s)
> 2020-03-11T04:39:40.8340886Z ##[error]mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: 
> TestSchedulerRecovery in scheduler_recovery_test.go:213
> {code}
> {code}
> 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s)
> 2020-03-11T04:39:40.9103549Z ##[error]mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: TestBasicScheduler 
> in scheduler_smoke_test.go:341
> {code}
> {code}
> 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s)
> 2020-03-06T07:17:50.4574239Z ##[error]scheduler_reservation_test.go:276: 
> partition reservations are missing
> {code}
> {code}
> 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s)
> 2020-03-06T08:08:21.8917559Z ##[error]scheduler_utils.go:79: Failed to 
> wait for pending resource, expected 80, actual 60, called from: 
> TestRemoveReservedNode in scheduler_reservation_test.go:356
> {code}
> {code}
> 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s)
> 2020-03-04T10:42:16.5789359Z ##[error]scheduler_reservation_test.go:357: 
> assertion failed: 2 (int) != 1 (int): reservations missing from app
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-30) flaky tests cause build failures on PRs

2020-03-22 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064546#comment-17064546
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-30:
---

I have found the issue and have proof of the it finally:
{code}
2020-03-23T14:52:59.893+1100DEBUG   scheduler/scheduling_application.go:529 
app reservation check   {"allocationKey": "alloc-3", "createTime": 
"2020-03-23T14:52:59.880+1100", "askAge": "13.39217ms", "reservation delay": 
"10ms"}
{code}
This was logged in a failure case for {{TestBasicScheduler}}.

This test does not set the reservation delay and it should still be set to the 
standard 2s. It seems to have picked up the setting from a previous run.

I have attached a full log to show the whole run. If the test would set the 
delay we should have seen a line like this in the log:
{code}
2020-03-23T14:52:57.772+1100DEBUG   scheduler/scheduling_application.go:65  
Test override reservation delay {"delay": "10ms"}
{code}
That line is nowhere in the logs.

> flaky tests cause build failures on PRs
> ---
>
> Key: YUNIKORN-30
> URL: https://issues.apache.org/jira/browse/YUNIKORN-30
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: test - smoke
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Attachments: TestBasicScheduler_github_fail.log
>
>
> Smoke tests have been failing on PR triggered builds.
> Failures are inconsistent and linked to multiple test cases, failures in the 
> same tests can even happen in different lines of code in different runs 
> without changes:
> {code}
> 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s)
> 2020-03-11T04:39:40.8340886Z ##[error]mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: 
> TestSchedulerRecovery in scheduler_recovery_test.go:213
> {code}
> {code}
> 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s)
> 2020-03-11T04:39:40.9103549Z ##[error]mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: TestBasicScheduler 
> in scheduler_smoke_test.go:341
> {code}
> {code}
> 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s)
> 2020-03-06T07:17:50.4574239Z ##[error]scheduler_reservation_test.go:276: 
> partition reservations are missing
> {code}
> {code}
> 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s)
> 2020-03-06T08:08:21.8917559Z ##[error]scheduler_utils.go:79: Failed to 
> wait for pending resource, expected 80, actual 60, called from: 
> TestRemoveReservedNode in scheduler_reservation_test.go:356
> {code}
> {code}
> 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s)
> 2020-03-04T10:42:16.5789359Z ##[error]scheduler_reservation_test.go:357: 
> assertion failed: 2 (int) != 1 (int): reservations missing from app
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-30) flaky tests cause build failures on PRs

2020-03-22 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064503#comment-17064503
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-30:
---

Not sure if this is just logging or showing the go routine scheduling at work 
which might be some of the impact that we have in the git workflows. The 
shutdown is logged before we process the last events:
{code}
2020-03-20T16:37:20.061+1100 DEBUG cache/partition_info.go:581 added allocation 
\{"partitionName": "[rm:123]default", "appID": "app-3", "allocationUid": 
"31467f7e-b7b6-4f39-b823-cc1a6536a766", "allocKey": "alloc-3"}
2020-03-20T16:37:20.061+1100 DEBUG cache/partition_info.go:581 added allocation 
\{"partitionName": "[rm:123]default", "appID": "app-3", "allocationUid": 
"31467f7e-b7b6-4f39-b823-cc1a6536a766", "allocKey": "alloc-3"}
2020-03-20T16:37:20.061+1100 INFO entrypoint/service_context.go:38 
ServiceContext stop all services
2020-03-20T16:37:20.061+1100 DEBUG scheduler/scheduler.go:176 enqueued event 
\{"eventType": "*schedulerevent.SchedulerAllocationUpdatesEvent", "event": 
{"RejectedAllocations":null,"AcceptedAllocations":[{"NodeID":"node-2","ApplicationID":"app-3","QueueName":"root.leaf-2","AllocatedResource":{"Resources":{"memory":5,"vcore":5}},"AllocationKey":"alloc-3","Tags":null,"Priority":null,"PartitionName":"[rm:123]default"}],"NewAsks":null,"ToReleases":null,"ExistingAllocations":null,"RMId":""},
 "currentQueueSize": 0}
2020-03-20T16:37:20.061+1100 DEBUG scheduler/scheduling_partition.go:434 
allocation confirmation on partition \{"partition": "[rm:123]default", "appID": 
"app-3", "nodeID": "node-2", "allocKey": "alloc-3", "confirmation": true}
--- PASS: TestReservationForTwoQueues (0.11s)
{code}

> flaky tests cause build failures on PRs
> ---
>
> Key: YUNIKORN-30
> URL: https://issues.apache.org/jira/browse/YUNIKORN-30
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: test - smoke
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Attachments: TestBasicScheduler_github_fail.log
>
>
> Smoke tests have been failing on PR triggered builds.
> Failures are inconsistent and linked to multiple test cases, failures in the 
> same tests can even happen in different lines of code in different runs 
> without changes:
> {code}
> 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s)
> 2020-03-11T04:39:40.8340886Z ##[error]mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: 
> TestSchedulerRecovery in scheduler_recovery_test.go:213
> {code}
> {code}
> 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s)
> 2020-03-11T04:39:40.9103549Z ##[error]mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: TestBasicScheduler 
> in scheduler_smoke_test.go:341
> {code}
> {code}
> 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s)
> 2020-03-06T07:17:50.4574239Z ##[error]scheduler_reservation_test.go:276: 
> partition reservations are missing
> {code}
> {code}
> 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s)
> 2020-03-06T08:08:21.8917559Z ##[error]scheduler_utils.go:79: Failed to 
> wait for pending resource, expected 80, actual 60, called from: 
> TestRemoveReservedNode in scheduler_reservation_test.go:356
> {code}
> {code}
> 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s)
> 2020-03-04T10:42:16.5789359Z ##[error]scheduler_reservation_test.go:357: 
> assertion failed: 2 (int) != 1 (int): reservations missing from app
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-30) flaky tests cause build failures on PRs

2020-03-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063052#comment-17063052
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-30:
---

Further details:

I have what I think is the root cause behind some of the failures. I described 
one of the cases above that I found but some others show a different failure 
type.

* Incorrect partition name: we seem to call normalise on the partition name 
twice during our testing.
{code}
2020-03-19T21:31:53.9142459Z 2020-03-19T21:31:41.042Z   INFO
cache/cluster_info.go:618   Failed to find partition for allocation 
proposal{"partitionName": "[rm:123][rm:123]default"}
{code}
This should never be a problem and shows a bug in the code and we should be 
able to handle this. The fix is in the normalisation code to check if it is 
already normalised.

* Event handling: a generic underlying issue. During some local testing I 
noticed that we do not properly wait for the event handling to process all the 
events that are generated. In the case observed: allocation releases were still 
being processed while the end state check was performed. Those issues can be 
fixed by a proper wait in the test code.

* However in certain failures we see nothing. This could point to a problem 
with go routines not being scheduled. The logs for these cases show a blank 
period of about 1 sec (the max time we wait for things) between the normal 
processing and the wait timing out. I cannot really reproduce those yet.

Working on a PR to fix at least the majority of what I have found.

> flaky tests cause build failures on PRs
> ---
>
> Key: YUNIKORN-30
> URL: https://issues.apache.org/jira/browse/YUNIKORN-30
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: test - smoke
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Attachments: TestBasicScheduler_github_fail.log
>
>
> Smoke tests have been failing on PR triggered builds.
> Failures are inconsistent and linked to multiple test cases, failures in the 
> same tests can even happen in different lines of code in different runs 
> without changes:
> {code}
> 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s)
> 2020-03-11T04:39:40.8340886Z ##[error]mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: 
> TestSchedulerRecovery in scheduler_recovery_test.go:213
> {code}
> {code}
> 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s)
> 2020-03-11T04:39:40.9103549Z ##[error]mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: TestBasicScheduler 
> in scheduler_smoke_test.go:341
> {code}
> {code}
> 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s)
> 2020-03-06T07:17:50.4574239Z ##[error]scheduler_reservation_test.go:276: 
> partition reservations are missing
> {code}
> {code}
> 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s)
> 2020-03-06T08:08:21.8917559Z ##[error]scheduler_utils.go:79: Failed to 
> wait for pending resource, expected 80, actual 60, called from: 
> TestRemoveReservedNode in scheduler_reservation_test.go:356
> {code}
> {code}
> 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s)
> 2020-03-04T10:42:16.5789359Z ##[error]scheduler_reservation_test.go:357: 
> assertion failed: 2 (int) != 1 (int): reservations missing from app
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org