albertogpz commented on a change in pull request #7330:
URL: https://github.com/apache/geode/pull/7330#discussion_r812120574
##########
File path:
geode-wan/src/distributedTest/java/org/apache/geode/cache/wan/internal/cli/commands/WanCopyRegionCommandDUnitTest.java
##########
@@ -1460,11 +1460,15 @@ public void createServersAndRegions(int locatorPort,
List<VM> servers,
}
private void waitForWanCopyRegionCommandToStart(boolean useParallel, boolean
usePartitionedRegion,
- List<VM> servers) {
+ List<VM> servers) throws InterruptedException {
+ // Wait for the command execution to be registered in the service
final int executionsExpected = useParallel && usePartitionedRegion ?
servers.size() : 1;
await().untilAsserted(
() -> assertThat(getNumberOfCurrentExecutionsInServers(servers))
.isEqualTo(executionsExpected));
+ // Wait some extra milliseconds to allow for the command to actually start
to
+ // avoid flakyness in the tests.
+ Thread.sleep(100);
Review comment:
> @agingade, let me try to explain why this is not an bug in the command
but a problem with timing in the test case.
>
> At the beginning, the test puts 50000 entries whose keys are in the range
of 0-49999.
>
> Then, the wan-copy region command is executed and, in parallel, other puts
and deletes are done. The keys for these entries puts/deleted are in the range
of 50000-52000 (no intersection with the first entries put).
>
> The test tries to verify that only the entries put at the beginning are
copied (none of the entries put while the command is running). It does so by
verifying that the number of entries copied is exactly 50000 as reported by the
command. If any entry put while the command is running is copied by the
command, we should see that the number of entries copied is greater than 50000,
which is what we see when the test fails (50001). But the test needs to make
sure that the puts for the 50000-52000 range are done after the copy command
has started and not before. Otherwise, we could get a failure of the test
(which is what we see sometimes).
>
> The way to make sure that the second group of puts starts after the
command has started is by means of the `await()` you mention. But when the
`await()` returns, it only guarantees that the `Callable` that will execute the
command has been submitted to the pool (see
`WanCopyRegionFunctionService.execute()`) although the command may not have
effectively started.
>
> At that point, random operations (puts and deletes) can be started. Let's
say at t=x, entry with key=50000 is put.
>
> The time at which the Callable object that in turn calls the
`WanCopyRegionFunctionDelegate.wanCopyRegion()`is executed depends on the
scheduling. Let's say it executes its first line at t=x + 1. At that point, the
current time is recorded as (`functionStartTimestamp`). and a sleep of 500ms
will be done before entries are started to be copied.
>
> Then the command starts to copy entries. Any entry with a timestamp
greater than `functionStartTimestamp` will not be copied. In this case, the
timestamp for entry with key=50000 is x and given that `functionStartTimestamp`
is x+1, the entry will be copied. This would not point to an bug in the
command. Rather, that the entry with key=50000 was put before the command was
started.
>
> With the 100ms extra added to the wait we can be more certain that second
group of puts will be started after the `functionStartTimestamp` is recorded
and therefore, the timestamp for the entries of the second group of puts will
be higher than the `functionStartTimestamp`.
@agingade I have removed the sleep and changed the check to avoid the
flakyness by looking at the region size at the remote site.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]