GitHub user StephanEwen opened a pull request:
https://github.com/apache/flink/pull/2999
[FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent
lost files/directories when called concurrently
This is mainly relevant to tests and Local Mini Cluster executions.
The `FileOutputFormat` and its subclasses rely on
`FileSystem::initOutPathLocalFS()` to prepare the output directory. When
multiple parallel output writers call that method, there is a slim chance that
one parallel threads deletes the others directory. The checks that the method
has are not bullet proof.
I believe that this is the cause for many Travis test instabilities that we
observed over time.
Simply synchronizing that method per process should do the trick. Since it
is a rare initialization method, and only relevant in tests & local mini
cluster executions, it should be a price that is okay to pay. I see no other
way, as we do not have simple access to an atomic "check and delete and
recreate" file operation.
The synchronization also makes many "re-try" code paths obsolete (there
should be no re-tries needed on proper file systems).
### Tests
This is tricky to test. The test in `InitOutputPathTest.java` uses a series
of latch to re-produce the problematic thread execution interleaving to
validate the problem. The properly fixed variant cannot use that interleaving
(because it fixes the problem, duh), but pushes the thread interleaving
best-effort towards the case where the problem would occur, were the method not
properly synchronized. Sounds weird, I know.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/StephanEwen/incubator-flink fs_fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/2999.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2999
----
commit 7a4d5b2f53e3b8c27a5224e405cf1b671f474a58
Author: Stephan Ewen <[email protected]>
Date: 2016-12-13T18:15:11Z
[tests] Add 'CheckedThread' as a common test utility
commit e25b436f1de18be0dc2c3b02b82bf0b8203f0b44
Author: Stephan Ewen <[email protected]>
Date: 2016-12-13T18:12:12Z
[FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent
lost files when called concurrently.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---