[
https://issues.apache.org/jira/browse/FLINK-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746125#comment-15746125
]
ASF GitHub Bot commented on FLINK-5332:
---------------------------------------
GitHub user StephanEwen opened a pull request:
https://github.com/apache/flink/pull/2999
[FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent
lost files/directories when called concurrently
This is mainly relevant to tests and Local Mini Cluster executions.
The `FileOutputFormat` and its subclasses rely on
`FileSystem::initOutPathLocalFS()` to prepare the output directory. When
multiple parallel output writers call that method, there is a slim chance that
one parallel threads deletes the others directory. The checks that the method
has are not bullet proof.
I believe that this is the cause for many Travis test instabilities that we
observed over time.
Simply synchronizing that method per process should do the trick. Since it
is a rare initialization method, and only relevant in tests & local mini
cluster executions, it should be a price that is okay to pay. I see no other
way, as we do not have simple access to an atomic "check and delete and
recreate" file operation.
The synchronization also makes many "re-try" code paths obsolete (there
should be no re-tries needed on proper file systems).
### Tests
This is tricky to test. The test in `InitOutputPathTest.java` uses a series
of latch to re-produce the problematic thread execution interleaving to
validate the problem. The properly fixed variant cannot use that interleaving
(because it fixes the problem, duh), but pushes the thread interleaving
best-effort towards the case where the problem would occur, were the method not
properly synchronized. Sounds weird, I know.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/StephanEwen/incubator-flink fs_fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/2999.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2999
----
commit 7a4d5b2f53e3b8c27a5224e405cf1b671f474a58
Author: Stephan Ewen <[email protected]>
Date: 2016-12-13T18:15:11Z
[tests] Add 'CheckedThread' as a common test utility
commit e25b436f1de18be0dc2c3b02b82bf0b8203f0b44
Author: Stephan Ewen <[email protected]>
Date: 2016-12-13T18:12:12Z
[FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent
lost files when called concurrently.
----
> Non-thread safe FileSystem::initOutPathLocalFS() can cause lost
> files/directories in local execution
> ----------------------------------------------------------------------------------------------------
>
> Key: FLINK-5332
> URL: https://issues.apache.org/jira/browse/FLINK-5332
> Project: Flink
> Issue Type: Bug
> Components: Core
> Affects Versions: 1.2.0
> Reporter: Stephan Ewen
> Assignee: Stephan Ewen
> Priority: Critical
> Fix For: 1.2.0
>
>
> This is mainly relevant to tests and Local Mini Cluster executions.
> The {{FileOutputFormat}} and its subclasses rely on
> {{FileSystem::initOutPathLocalFS()}} to prepare the output directory. When
> multiple parallel output writers call that method, there is a slim chance
> that one parallel threads deletes the others directory. The checks that the
> method has are not bullet proof.
> I believe that this is the cause for many Travis test instabilities that we
> observed over time.
> Simply synchronizing that method per process should do the trick. Since it is
> a rare initialization method, and only relevant in tests & local mini cluster
> executions, it should be a price that is okay to pay. I see no other way, as
> we do not have simple access to an atomic "check and delete and recreate"
> file operation.
> The synchronization also makes many "re-try" code paths obsolete (there
> should be no re-tries needed on proper file systems).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)