Thanks for the review, Alex. I pushed an additional commit to add commentary and to explicitly pass a copy option that indicates we want to preserve attributes like timestamps.
On Wed, Jan 23, 2019 at 12:23 PM Alex Amato <[email protected]> wrote: > Thanks Jeff, I reviewed your PR with one suggestion to add a comment to > make the test more clear. I am assuming the modified times get copied, not > re-timestamped on copy, which is why your method works. > Otherwise looks good to me > > On Wed, Jan 23, 2019 at 5:49 AM Jeff Klukas <[email protected]> wrote: > >> Posted https://github.com/apache/beam/pull/7599 >> >> That PR follows suggestion #4. I chose that route because it maintains >> the PAssert containsInAnyOrder check which seems easier to read and more >> straight-forward than PAssert satisfies. >> >> Do let me know if you disagree and I can switch back to Eugene's >> suggestion #1. >> >> On Wed, Jan 23, 2019 at 8:12 AM Jeff Klukas <[email protected]> wrote: >> >>> Suggestion #4: Create source files outside the writer thread, and then >>> copy them from a source directory to the watched directory. That should >>> atomically write the file with the already known lastModificationTime. >>> >>> On Wed, Jan 23, 2019 at 7:37 AM Jeff Klukas <[email protected]> wrote: >>> >>>> I'll work on getting a PR together this morning, probably following >>>> Eugene's suggestion #1. >>>> >>>> On Tue, Jan 22, 2019 at 8:34 PM Udi Meiri <[email protected]> wrote: >>>> >>>>> Alex, the only way to implement my suggestion #1 (that I know of) >>>>> would be to write to a file and read it back. >>>>> I don't have good example for #2. >>>>> >>>>> Eugene's suggestion no. 1 seems like a good idea. There are some >>>>> example >>>>> <https://github.com/apache/beam/blob/324a1bcc820945731ccce7dd7e5354247b841356/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIOWriteTest.java#L335-L340> >>>>> in the codebase. >>>>> >>>>> On Tue, Jan 22, 2019 at 5:16 PM Eugene Kirpichov <[email protected]> >>>>> wrote: >>>>> >>>>>> Yeah the "List<MatchResult.Metadata> expected" is constructed >>>>>> from Files.getLastModifiedTime() calls before the files are actually >>>>>> modified, the code is basically unconditionally broken rather than merely >>>>>> flaky. >>>>>> >>>>>> There's several easy options: >>>>>> 1) Use PAssert.that().satisfies() instead of .contains(), and use >>>>>> assertThat().contains() inside that, with the list constructed at time >>>>>> the >>>>>> assertion is applied rather than declared. >>>>>> 2) Implement a Matcher<Metadata> that ignores last modified time and >>>>>> use that >>>>>> >>>>>> Jeff - your option #3 is unfortunately also race-prone, because the >>>>>> code may match the files after they have been written but before >>>>>> setLastModifiedTime was called. >>>>>> >>>>>> On Tue, Jan 22, 2019 at 5:08 PM Jeff Klukas <[email protected]> wrote: >>>>>> >>>>>>> Another option: >>>>>>> >>>>>>> #3 Have the writer thread call Files.setLastModifiedTime explicitly >>>>>>> after each File.write. Then the lastModifiedMillis can be a stable value >>>>>>> for each file and we can use those same static values in our expected >>>>>>> result. I think that would also eliminate the race condition. >>>>>>> >>>>>>> On Tue, Jan 22, 2019 at 7:48 PM Alex Amato <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Udi, is there a good example for either of these? >>>>>>>> #1 - seems like you have to rewrite your assertion logic without >>>>>>>> the PAssert? Is there some way to capture the pipeline output and >>>>>>>> iterate >>>>>>>> over it? The pattern I have seen for this in the past also has thread >>>>>>>> safety issues (Using a DoFn at the end of the pipeline to add the >>>>>>>> output to >>>>>>>> a collection is not safe since the collection can be executed >>>>>>>> concurrently) >>>>>>>> #2 - Would BigqueryMatcher be a good example for this? which is >>>>>>>> used in BigQueryTornadoesIT.java Or is there another example you would >>>>>>>> suggest looking at for reference? >>>>>>>> >>>>>>>> - I guess to this you need to implement the SerializableMatcher >>>>>>>> interface and use the matcher as an option in the pipeline options. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jan 22, 2019 at 4:28 PM Udi Meiri <[email protected]> wrote: >>>>>>>> >>>>>>>>> Some options: >>>>>>>>> - You could wait to assert until after p.waitForFinish(). >>>>>>>>> - You could PAssert using SerializableMatcher and allow any >>>>>>>>> lastModifiedTime. >>>>>>>>> >>>>>>>>> On Tue, Jan 22, 2019 at 3:56 PM Alex Amato <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> +Jeff, Eugene, >>>>>>>>>> >>>>>>>>>> Hi Jeff and Eugene, >>>>>>>>>> >>>>>>>>>> I've noticed that Jeff's PR >>>>>>>>>> <https://github.com/apache/beam/commit/410d6c7b5f933dcb0280894553c1e576ee4e4884> >>>>>>>>>> introduced >>>>>>>>>> a race condition in this test, but its not clear exactly how to add >>>>>>>>>> Jeff's >>>>>>>>>> test check in a thread safe way. I believe this to be the source of >>>>>>>>>> the >>>>>>>>>> flakeyness Do you have any suggestions Eugene (since you authored >>>>>>>>>> this >>>>>>>>>> test)? >>>>>>>>>> >>>>>>>>>> I added some details to this JIRA issue explaining in full >>>>>>>>>> https://jira.apache.org/jira/browse/BEAM-6491?filter=-2 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jan 22, 2019 at 3:34 PM Alex Amato <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I've seen this fail in a few different PRs for different >>>>>>>>>>> contributors, and its causing some issues during the presubmit >>>>>>>>>>> process.. >>>>>>>>>>> This is a multithreadred test with a lot of sleeps, so it looks a >>>>>>>>>>> bit >>>>>>>>>>> suspicious as the source of the problem. >>>>>>>>>>> >>>>>>>>>>> https://builds.apache.org/job/beam_PreCommit_Java_Commit/3688/testReport/org.apache.beam.sdk.io/FileIOTest/testMatchWatchForNewFiles/ >>>>>>>>>>> >>>>>>>>>>> I filed a JIRA for this issue: >>>>>>>>>>> https://jira.apache.org/jira/browse/BEAM-6491?filter=-2 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>
