[ 
https://issues.apache.org/jira/browse/FLINK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706734#comment-17706734
 ] 

Matthias Pohl commented on FLINK-31278:
---------------------------------------

I had an offline discussion with Chesnay on that matter. He pointed out that 
the error is actually not directly coming from the JVM but from Docker itself. 
But even there, 137 exit code usually means (based on historic data) that it's 
caused by a memory issue. A solution we discussed was the following one:

The unit tests run in 4 JVMs in parallel within docker right now. Docker 
failing with a 137 exit code probably means that the JVMs took up too much 
memory. Therefore, a possible solution to provoke such an error again would be 
to in crease the number of JVMs running parallel without changing there memory 
setup. 

But Chesnay had another point: Due to it happening quite rarely it could be 
that it's being caused by certain tests running at the same time. The 
previously proposed setup wouldn't cover this issue. Alternatively, we could 
let a concurrent process run in docker while we execute the junit tests in a 
single JVM. The custom process acquires a specific amount of memory which we 
can slowly increase. Any test that's requires an unusual amount of memory 
should fail first. That might be a viable solution to identify memory issues 
that caused Docker to fail.

> exit code 137 (i.e. OutOfMemoryError) in core module
> ----------------------------------------------------
>
>                 Key: FLINK-31278
>                 URL: https://issues.apache.org/jira/browse/FLINK-31278
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.17.0
>            Reporter: Matthias Pohl
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>
> The following build failed due to a 137 exit code indicating an 
> OutOfMemoryError:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46643&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=7847
> {code}
> [...]
> Mar 01 05:29:06 [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time 
> elapsed: 0.65 s - in 
> org.apache.flink.runtime.io.compression.BlockCompressionTest
> Mar 01 05:29:06 [INFO] Running 
> org.apache.flink.runtime.dispatcher.DispatcherCachedOperationsHandlerTest
> Mar 01 05:29:07 [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time 
> elapsed: 1.142 s - in 
> org.apache.flink.runtime.dispatcher.DispatcherCachedOperationsHandlerTest
> Mar 01 05:29:08 [INFO] Running 
> org.apache.flink.runtime.dispatcher.MemoryExecutionGraphInfoStoreTest
> ##[error]Exit code 137 returned from process: file name '/usr/bin/docker', 
> arguments 'exec -i -u 1001  -w /home/vsts_azpcontainer 
> 5953b171e8ed4caba7af2b326533e249211ed4dcc48640edb3c1b0cbbcdf1a21 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> Finishing: Test - core
> {code}
> This build ran on an Azure pipeline machine (Azure Pipelines 9) and, 
> therefore, cannot be caused by FLINK-18356. That said, there was a concurrent 
> 137 exit code build failure happening on agent "Azure Pipelines 21" (see 
> [20230301.3|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46643&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=7847])
>  ~10mins later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to