[jira] [Commented] (BEAM-3272) ParDoTranslatorTest: Error creating local cluster while creating checkpoint file

2018-02-06 Thread Thomas Weise (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354292#comment-16354292
 ] 

Thomas Weise commented on BEAM-3272:


If parallelism is enabled in gradle, that could be the issue. Currently the 
tests don't create unique directories and cannot be run in parallel. Also, 
there is an assumption that 'target' is the build directory, which may also 
cause issues when running the tests from gradle.

> ParDoTranslatorTest: Error creating local cluster while creating checkpoint 
> file
> 
>
> Key: BEAM-3272
> URL: https://issues.apache.org/jira/browse/BEAM-3272
> Project: Beam
>  Issue Type: Bug
>  Components: runner-apex
>Reporter: Eugene Kirpichov
>Assignee: Kenneth Knowles
>Priority: Critical
>  Labels: flake
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Failed build: 
> https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-apex/5330/console
> Key output:
> {code}
> 2017-11-29T01:21:26.956 [ERROR] 
> testAssertionFailure(org.apache.beam.runners.apex.translation.ParDoTranslatorTest)
>   Time elapsed: 2.007 s  <<< ERROR!
> java.lang.RuntimeException: Error creating local cluster
>   at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.getController(EmbeddedAppLauncherImpl.java:122)
>   at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.launchApp(EmbeddedAppLauncherImpl.java:71)
>   at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.launchApp(EmbeddedAppLauncherImpl.java:46)
>   at org.apache.beam.runners.apex.ApexRunner.run(ApexRunner.java:197)
>   at 
> org.apache.beam.runners.apex.TestApexRunner.run(TestApexRunner.java:57)
>   at 
> org.apache.beam.runners.apex.TestApexRunner.run(TestApexRunner.java:31)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:304)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:290)
>   at 
> org.apache.beam.runners.apex.translation.ParDoTranslatorTest.runExpectingAssertionFailure(ParDoTranslatorTest.java:156)
> {code}
> ...
> {code}
> Caused by: ExitCodeException exitCode=1: chmod: cannot access 
> ‘/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Java_MavenInstall/src/runners/apex/target/com.datatorrent.stram.StramLocalCluster/checkpoints/2/_tmp’:
>  No such file or directory
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
>   at org.apache.hadoop.util.Shell.run(Shell.java:479)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:225)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:209)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
>   at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1017)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:99)
>   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:352)
>   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:399)
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:584)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:686)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:682)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.create(FileContext.java:688)
>   at 
> com.datatorrent.common.util.AsyncFSStorageAgent.copyToHDFS(AsyncFSStorageAgent.java:119)
>   ... 50 more
> {code}
> By inspecting code at the stack frames, seems it's trying to copy an 
> operator's checkpoint "to HDFS" (which in this case is the local disk), but 
> fails while creating the target file of the copy - creation creates the file 
> (successfully) and chmods it writable (unsuccessfully). Barring something 
> subtle (e.g. chmod being not allowed to call immediately after creating a 
> FileOutputStream), this looks like the whole directory was possibly deleted 
> from under the process. I don't know why this would be the case though, or 
> how to debug it.
> Either way, the path being accessed is funky: 
> 

[jira] [Commented] (BEAM-3272) ParDoTranslatorTest: Error creating local cluster while creating checkpoint file

2018-02-06 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354263#comment-16354263
 ] 

Kenneth Knowles commented on BEAM-3272:
---

It is worse in gradle, perhaps due to parallelism and/or tighter management of 
directories that gradle considers that it owns.

> ParDoTranslatorTest: Error creating local cluster while creating checkpoint 
> file
> 
>
> Key: BEAM-3272
> URL: https://issues.apache.org/jira/browse/BEAM-3272
> Project: Beam
>  Issue Type: Bug
>  Components: runner-apex
>Reporter: Eugene Kirpichov
>Assignee: Kenneth Knowles
>Priority: Critical
>  Labels: flake
>
> Failed build: 
> https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-apex/5330/console
> Key output:
> {code}
> 2017-11-29T01:21:26.956 [ERROR] 
> testAssertionFailure(org.apache.beam.runners.apex.translation.ParDoTranslatorTest)
>   Time elapsed: 2.007 s  <<< ERROR!
> java.lang.RuntimeException: Error creating local cluster
>   at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.getController(EmbeddedAppLauncherImpl.java:122)
>   at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.launchApp(EmbeddedAppLauncherImpl.java:71)
>   at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.launchApp(EmbeddedAppLauncherImpl.java:46)
>   at org.apache.beam.runners.apex.ApexRunner.run(ApexRunner.java:197)
>   at 
> org.apache.beam.runners.apex.TestApexRunner.run(TestApexRunner.java:57)
>   at 
> org.apache.beam.runners.apex.TestApexRunner.run(TestApexRunner.java:31)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:304)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:290)
>   at 
> org.apache.beam.runners.apex.translation.ParDoTranslatorTest.runExpectingAssertionFailure(ParDoTranslatorTest.java:156)
> {code}
> ...
> {code}
> Caused by: ExitCodeException exitCode=1: chmod: cannot access 
> ‘/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Java_MavenInstall/src/runners/apex/target/com.datatorrent.stram.StramLocalCluster/checkpoints/2/_tmp’:
>  No such file or directory
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
>   at org.apache.hadoop.util.Shell.run(Shell.java:479)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:225)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:209)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
>   at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1017)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:99)
>   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:352)
>   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:399)
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:584)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:686)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:682)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.create(FileContext.java:688)
>   at 
> com.datatorrent.common.util.AsyncFSStorageAgent.copyToHDFS(AsyncFSStorageAgent.java:119)
>   ... 50 more
> {code}
> By inspecting code at the stack frames, seems it's trying to copy an 
> operator's checkpoint "to HDFS" (which in this case is the local disk), but 
> fails while creating the target file of the copy - creation creates the file 
> (successfully) and chmods it writable (unsuccessfully). Barring something 
> subtle (e.g. chmod being not allowed to call immediately after creating a 
> FileOutputStream), this looks like the whole directory was possibly deleted 
> from under the process. I don't know why this would be the case though, or 
> how to debug it.
> Either way, the path being accessed is funky: 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Java_MavenInstall/src/runners/apex/target/...
>  - I think it'd be better if this test used a "@Rule TemporaryFolder" to 
> store Apex checkpoints. I don't 

[jira] [Commented] (BEAM-3272) ParDoTranslatorTest: Error creating local cluster while creating checkpoint file

2018-02-06 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354261#comment-16354261
 ] 

Kenneth Knowles commented on BEAM-3272:
---

This is happening quite a lot. I am going to sickbay the test for now.

> ParDoTranslatorTest: Error creating local cluster while creating checkpoint 
> file
> 
>
> Key: BEAM-3272
> URL: https://issues.apache.org/jira/browse/BEAM-3272
> Project: Beam
>  Issue Type: Bug
>  Components: runner-apex
>Reporter: Eugene Kirpichov
>Assignee: Kenneth Knowles
>Priority: Critical
>  Labels: flake
>
> Failed build: 
> https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-apex/5330/console
> Key output:
> {code}
> 2017-11-29T01:21:26.956 [ERROR] 
> testAssertionFailure(org.apache.beam.runners.apex.translation.ParDoTranslatorTest)
>   Time elapsed: 2.007 s  <<< ERROR!
> java.lang.RuntimeException: Error creating local cluster
>   at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.getController(EmbeddedAppLauncherImpl.java:122)
>   at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.launchApp(EmbeddedAppLauncherImpl.java:71)
>   at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.launchApp(EmbeddedAppLauncherImpl.java:46)
>   at org.apache.beam.runners.apex.ApexRunner.run(ApexRunner.java:197)
>   at 
> org.apache.beam.runners.apex.TestApexRunner.run(TestApexRunner.java:57)
>   at 
> org.apache.beam.runners.apex.TestApexRunner.run(TestApexRunner.java:31)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:304)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:290)
>   at 
> org.apache.beam.runners.apex.translation.ParDoTranslatorTest.runExpectingAssertionFailure(ParDoTranslatorTest.java:156)
> {code}
> ...
> {code}
> Caused by: ExitCodeException exitCode=1: chmod: cannot access 
> ‘/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Java_MavenInstall/src/runners/apex/target/com.datatorrent.stram.StramLocalCluster/checkpoints/2/_tmp’:
>  No such file or directory
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
>   at org.apache.hadoop.util.Shell.run(Shell.java:479)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:225)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:209)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
>   at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1017)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:99)
>   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:352)
>   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:399)
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:584)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:686)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:682)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.create(FileContext.java:688)
>   at 
> com.datatorrent.common.util.AsyncFSStorageAgent.copyToHDFS(AsyncFSStorageAgent.java:119)
>   ... 50 more
> {code}
> By inspecting code at the stack frames, seems it's trying to copy an 
> operator's checkpoint "to HDFS" (which in this case is the local disk), but 
> fails while creating the target file of the copy - creation creates the file 
> (successfully) and chmods it writable (unsuccessfully). Barring something 
> subtle (e.g. chmod being not allowed to call immediately after creating a 
> FileOutputStream), this looks like the whole directory was possibly deleted 
> from under the process. I don't know why this would be the case though, or 
> how to debug it.
> Either way, the path being accessed is funky: 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Java_MavenInstall/src/runners/apex/target/...
>  - I think it'd be better if this test used a "@Rule TemporaryFolder" to 
> store Apex checkpoints. I don't know whether the Apex runner allows that, but 
> I can see