[ 
https://issues.apache.org/jira/browse/FLINK-16636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088467#comment-17088467
 ] 

Caizhi Weng edited comment on FLINK-16636 at 4/21/20, 9:39 AM:
---------------------------------------------------------------

I've tried to run the IT cases on my local machine and read some surefire 
source code. Although I'm not 100% sure  about the reason of this problem, the 
following is my guessing:

Surefire plugin needs to periodically check whether the maven process (its 
parent process) is still running so that it can stop itself when the maven 
process unexpectedly exits (see {{processCheckerJob}} in [this 
code|https://github.com/apache/maven-surefire/blob/master/surefire-booter/src/main/java/org/apache/maven/surefire/booter/ForkedBooter.java]).
 To achieve this, it periodically calls 
[{{PpidChecker#isProcessAlive}}|https://github.com/apache/maven-surefire/blob/master/surefire-booter/src/main/java/org/apache/maven/surefire/booter/PpidChecker.java#L112]
 and finally goes into 
[{{unix}}|https://github.com/apache/maven-surefire/blob/master/surefire-booter/src/main/java/org/apache/maven/surefire/booter/PpidChecker.java#L163]
 method. In this method, it builds up a shell command with a {{ProcessBuilder}} 
and executes it with {{UNIXProcess.forkAndExec}} (see stack in [[~rmetzger]'s 
post|https://issues.apache.org/jira/browse/FLINK-16636?focusedCommentId=17087385&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17087385]).

It's true that we've limited each JVM with 2GB heap size (a total of 4GB heap 
size). But as the shell command is executed by forking a new process, when this 
surefire check happens, it is possible that the memory usage doubles, thus 
reaching the hard 8GB limit of the container.

There are some IT cases which uses a real lot of heap memory (especially 
{{HashAggITCase}}), but according to my local {{top}} command, currently the 
memory usage of all tests seems to be precisely a bit lower than what is needed 
to trigger a full GC (memory usage of each JVM goes up to about 2000MB). As 
{{TableEnvironmentITCase}} and {{TableUtilsStreamingITCase}} are the last two 
test cases to run, the memory failure is most likely to happen in their turn.

So it seems that the only thing that goes wrong is the heap limit we apply to 
the testing JVMs. I think lowering the heap limit a bit can solve this issue.


was (Author: tsreaper):
I've tried to run the IT cases on my local machine and read some surefire 
source code. Although I'm not 100% sure  about the reason of this problem, the 
following is my guessing:

Surefire plugin needs to periodically check whether the maven process (its 
parent process) is still running so that it can stop itself when the maven 
process unexpectedly exits (see {{processCheckerJob}} in [this 
code|https://github.com/apache/maven-surefire/blob/master/surefire-booter/src/main/java/org/apache/maven/surefire/booter/ForkedBooter.java]).
 To achieve this, it periodically calls 
[{{PpidChecker#isProcessAlive}}|https://github.com/apache/maven-surefire/blob/master/surefire-booter/src/main/java/org/apache/maven/surefire/booter/PpidChecker.java#L112]
 and finally goes into 
[{{unix}}|https://github.com/apache/maven-surefire/blob/master/surefire-booter/src/main/java/org/apache/maven/surefire/booter/PpidChecker.java#L163]
 method. In this method, it builds up a shell command with a {{ProcessBuilder}} 
and executes it with {{UNIXProcess.forkAndExec}}.

It's true that we've limited each JVM with 2GB heap size (a total of 4GB heap 
size). But as the shell command is executed by forking a new process, when this 
surefire check happens, it is possible that the memory usage doubles, thus 
reaching the hard 8GB limit of the container.

There are some IT cases which uses a real lot of heap memory (especially 
{{HashAggITCase}}), but according to my local {{top}} command, currently the 
memory usage of all tests seems to be precisely a bit lower than what is needed 
to trigger a full GC (memory usage of each JVM goes up to about 2000MB). As 
{{TableEnvironmentITCase}} and {{TableUtilsStreamingITCase}} are the last two 
test cases to run, the memory failure is most likely to happen in their turn.

So it seems that the only thing that goes wrong is the heap limit we apply to 
the testing JVMs. I think lowering the heap limit a bit can solve this issue.

> TableEnvironmentITCase is crashing on Travis
> --------------------------------------------
>
>                 Key: FLINK-16636
>                 URL: https://issues.apache.org/jira/browse/FLINK-16636
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / Planner
>    Affects Versions: 1.11.0
>            Reporter: Jark Wu
>            Assignee: Caizhi Weng
>            Priority: Blocker
>              Labels: pull-request-available, test-stability
>             Fix For: 1.11.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Here is the instance and exception stack: 
> https://api.travis-ci.org/v3/job/663408376/log.txt
> But there is not too much helpful information there, maybe a accidental maven 
> problem.
> {code}
> 09:55:07.703 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.22.1:test 
> (integration-tests) on project flink-table-planner-blink_2.11: There are test 
> failures.
> 09:55:07.703 [ERROR] 
> 09:55:07.703 [ERROR] Please refer to 
> /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire-reports
>  for the individual test results.
> 09:55:07.703 [ERROR] Please refer to dump files (if any exist) [date].dump, 
> [date]-jvmRun[N].dump and [date].dumpstream.
> 09:55:07.703 [ERROR] ExecutionException The forked VM terminated without 
> properly saying goodbye. VM crash or System.exit called?
> 09:55:07.703 [ERROR] Command was /bin/sh -c cd 
> /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target 
> && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m 
> -Dmvn.forkNumber=1 -XX:+UseG1GC -jar 
> /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire/surefirebooter714252487017838305.jar
>  
> /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire
>  2020-03-17T09-34-41_826-jvmRun1 surefire4625103637332937565tmp 
> surefire_43192129054983363633tmp
> 09:55:07.703 [ERROR] Error occurred in starting fork, check output in log
> 09:55:07.703 [ERROR] Process Exit Code: 137
> 09:55:07.703 [ERROR] Crashed tests:
> 09:55:07.703 [ERROR] org.apache.flink.table.api.TableEnvironmentITCase
> 09:55:07.703 [ERROR] 
> org.apache.maven.surefire.booter.SurefireBooterForkException: 
> ExecutionException The forked VM terminated without properly saying goodbye. 
> VM crash or System.exit called?
> 09:55:07.703 [ERROR] Command was /bin/sh -c cd 
> /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target 
> && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m 
> -Dmvn.forkNumber=1 -XX:+UseG1GC -jar 
> /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire/surefirebooter714252487017838305.jar
>  
> /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire
>  2020-03-17T09-34-41_826-jvmRun1 surefire4625103637332937565tmp 
> surefire_43192129054983363633tmp
> 09:55:07.703 [ERROR] Error occurred in starting fork, check output in log
> 09:55:07.703 [ERROR] Process Exit Code: 137
> 09:55:07.703 [ERROR] Crashed tests:
> 09:55:07.703 [ERROR] org.apache.flink.table.api.TableEnvironmentITCase
> 09:55:07.703 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:510)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:382)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:297)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:246)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216)
> 09:55:07.704 [ERROR] at org.apache.maven.cli.MavenCli.main(MavenCli.java:160)
> 09:55:07.704 [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> 09:55:07.704 [ERROR] at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 09:55:07.704 [ERROR] at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 09:55:07.704 [ERROR] at java.lang.reflect.Method.invoke(Method.java:498)
> 09:55:07.704 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> 09:55:07.704 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> 09:55:07.704 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> 09:55:07.704 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> 09:55:07.704 [ERROR] Caused by: 
> org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?
> 09:55:07.704 [ERROR] Command was /bin/sh -c cd 
> /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target 
> && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m 
> -Dmvn.forkNumber=1 -XX:+UseG1GC -jar 
> /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire/surefirebooter714252487017838305.jar
>  
> /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire
>  2020-03-17T09-34-41_826-jvmRun1 surefire4625103637332937565tmp 
> surefire_43192129054983363633tmp
> 09:55:07.704 [ERROR] Error occurred in starting fork, check output in log
> 09:55:07.704 [ERROR] Process Exit Code: 137
> 09:55:07.704 [ERROR] Crashed tests:
> 09:55:07.704 [ERROR] org.apache.flink.table.api.TableEnvironmentITCase
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:669)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$600(ForkStarter.java:115)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:371)
> 09:55:07.704 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:347)
> 09:55:07.704 [ERROR] at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 09:55:07.704 [ERROR] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 09:55:07.704 [ERROR] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 09:55:07.704 [ERROR] at java.lang.Thread.run(Thread.java:748)
> 09:55:07.704 [ERROR] -> [Help 1]
> 09:55:07.704 [ERROR] 
> 09:55:07.704 [ERROR] To see the full stack trace of the errors, re-run Maven 
> with the -e switch.
> 09:55:07.704 [ERROR] Re-run Maven using the -X switch to enable full debug 
> logging.
> 09:55:07.704 [ERROR] 
> 09:55:07.704 [ERROR] For more information about the errors and possible 
> solutions, please read the following articles:
> 09:55:07.704 [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> 09:55:07.704 [ERROR] 
> 09:55:07.704 [ERROR] After correcting the problems, you can resume the build 
> with the command
> 09:55:07.704 [ERROR]   mvn <goals> -rf :flink-table-planner-blink_2.11
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to