[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-09-15 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196111#comment-17196111
 ] 

Chesnay Schepler commented on FLINK-19005:
--

[~gestevez]From what I can tell the dump looks OK to me. The 
ChildFirstClassLoaders are still around, but hold way less references, and will 
likely be gc'd at a later point. In one of the loaders there are some lingering 
threads, but given that these are not present in other loaders I'd assume that 
they will shutdown at a later point.

I will add a note to the classloading-debugging documentation about JDBC.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz, 
> heapdump-after_metaspace_dropped.hprof, modified-jdbc-inputformat.png, 
> origin-jdbc-inputformat.png
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-09-01 Thread Jira


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188456#comment-17188456
 ] 

Guillermo Sánchez commented on FLINK-19005:
---

Thanks [~chesnay], [~trohrmann], I tried your solution, changed oracle library 
dependency on my jobs pom.xml to provided, and now it consumes much less 
metaspace and creates less classes on every execution, and finally, used 
metaspace drops some hours after executions finished.

I have uploaded [^heapdump-after_metaspace_dropped.hprof] file, generated after 
metaspace had dropped, after 6 executions of my batch job.
Could you examine it to check if there is still any memory leak or it is 
completely solved?

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz, 
> heapdump-after_metaspace_dropped.hprof, modified-jdbc-inputformat.png, 
> origin-jdbc-inputformat.png
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-31 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187492#comment-17187492
 ] 

Till Rohrmann commented on FLINK-19005:
---

The per-job mode does also work in standalone mode via using 
{{standalone-job.sh}}. But it won't stop the started {{TaskManagers}}. Hence it 
is easier to use it when running on Yarn or K8s.

At the moment I think Chesnay's suggestion to put the dependency into {{/lib}} 
is the best option to work around the problem. Once we have FLINK-17554, we 
might also solve this problem in a more convenient way.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz, 
> modified-jdbc-inputformat.png, origin-jdbc-inputformat.png
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-28 Thread Jira


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186472#comment-17186472
 ] 

João Boto commented on FLINK-19005:
---

[~trohrmann] that can be applied to a standalone cluster? ([~DaDaShen]  and  
[~gestevez]  are using standalone cluster)

 

It seems that Flink is not clearing the JDBC references, if we only use 
TableEnvironment or InputFormat we only provide the jdbc library, all iteration 
with the library is done by Flink, and the cleaning up must be done by Flink as 
we dont have access to it.

 

 

I will help [~gestevez]  to prove the [~chesnay]  option of add it to /lib 
instead of bundling them in the user-jar..

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz, 
> modified-jdbc-inputformat.png, origin-jdbc-inputformat.png
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-28 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186389#comment-17186389
 ] 

Chesnay Schepler commented on FLINK-19005:
--

I don't think anyone has a plan to solve it. I can offer a potential workaround 
though:

What you could try is putting the jdbc drivers into /lib instead of bundling 
them in the user-jar, and adding them to the list of [parent-first loaded 
classes|[https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#classloader-parent-first-patterns-additional].]That
 should ensure that the driver is only created once, outside the 
user-classloader.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz, 
> modified-jdbc-inputformat.png, origin-jdbc-inputformat.png
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-28 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186384#comment-17186384
 ] 

Till Rohrmann commented on FLINK-19005:
---

One thing you can do is to use Flink's per job mode deployment. That way you 
will create a new cluster for every job and terminate the cluster once the job 
is done. That way, you won't run into the problem of accumulating class loader 
leaks.

If the JDBC classes are strictly loaded by the user code class loader, then 
FLINK-17554 might help by allowing to register shut down hooks for every user 
code class loader.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz, 
> modified-jdbc-inputformat.png, origin-jdbc-inputformat.png
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-28 Thread ShenDa (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186353#comment-17186353
 ] 

ShenDa commented on FLINK-19005:


[~chesnay] 
Looks like it's hard to solve the problem.
Do we any plan or good idea to fix this? Or is there  any other place to 
discuss the problem?

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz, 
> modified-jdbc-inputformat.png, origin-jdbc-inputformat.png
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-28 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186325#comment-17186325
 ] 

Chesnay Schepler commented on FLINK-19005:
--

Well of course that works; you clean _all drivers,_ then things can get GC'd.

However, you also mess with any other Task from the same job running in that 
JVM that requires JDBC at the same time, because all of a sudden you are 
de-registering drivers, and there is unfortunately no guarantee that they will 
be re-registered when the class is reloaded (because JDBC is a mess).

Additionally, you are also de-registering all drivers that were loaded by the 
SystemClassLoader, which presumably should be left alone.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz, 
> modified-jdbc-inputformat.png, origin-jdbc-inputformat.png
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-27 Thread ShenDa (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186221#comment-17186221
 ] 

ShenDa commented on FLINK-19005:


[~chesnay] 
Thanks for your detailed instruction. 
But I still think there's maybe something wrong in Flink. I find that the 
JdbcInputFormat & JdbcOutputFormat is key reason cause the Metaspace OOM, 
because the java.sql.DriverManager doesn't release the reference of the Driver. 
The DriverManager is loaded by java.internal.ClassLoader but the driver is 
loaded by ChildFisrtClassLoader, which means the ChildFirstClassLoader can't be 
garbage collected according analyzation of dump file.  
The following code is used by me to reproduce the issue and  I use 
org.postgresql.Driver as jdbc Driver.
{code:java}
public static void main(String[] args) throws Exception {
EnvironmentSettings envSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner() !origin-jdbc-inputformat.png! 
.inBatchMode()
.build();
TableEnvironment tEnv = TableEnvironment.create(envSettings);

tEnv.executeSql(
"CREATE TABLE " + INPUT_TABLE + "(" +
"id BIGINT," +
"timestamp6_col TIMESTAMP(6)," +
"timestamp9_col TIMESTAMP(6)," +
"time_col TIME," +
"real_col FLOAT," +
"decimal_col DECIMAL(10, 4)" +
") WITH (" +
"  'connector.type'='jdbc'," +
"  'connector.url'='" + DB_URL + "'," +
"  'connector.table'='" + INPUT_TABLE + 
"'," +
"  'connector.USERNAME'='" + USERNAME + 
"'," +
"  'connector.PASSWORD'='" + PASSWORD + 
"'" +
")"
);

TableResult tableResult = tEnv.executeSql("SELECT timestamp6_col, 
decimal_col FROM " + INPUT_TABLE);
tableResult.collect();
}
{code}
And below diagram shows the Metaspace usage constantly growing up, and finally 
TaskManager will be offline.
 !origin-jdbc-inputformat.png! 


Additional, I try to fix this issue by appending the following code to the 
function closeInputFormat() which can finally trigger garbage collect in 
Metaspace.

{code:java}
try{
final Enumeration drivers = DriverManager.getDrivers();
while (drivers.hasMoreElements()) {
DriverManager.deregisterDriver(drivers.nextElement());
}
} catch (SQLException se) {
LOG.info("Inputformat couldn't be closed - " + se.getMessage());
}
{code}
The following diagram shows the usage of Metaspace will be decreased.
 !modified-jdbc-inputformat.png! 

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz, 
> modified-jdbc-inputformat.png, origin-jdbc-inputformat.png
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian 

[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-25 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183942#comment-17183942
 ] 

Chesnay Schepler commented on FLINK-19005:
--

[~DaDaShen]
 * load heapdump in ecliplse MAT
 * create histogram
 * group classes by classloader
 * among others we can see several ChildFirstClassLoader objects
 ** these are the user classloaders
 ** because they are still around, something is leaking it
 * select one of these entries, and merge the shortest paths to GC roots
 * there is now one entry for the system classloader
 * drilling down into it we find the {{java.sql.DriverManager}}
 ** the contained registeredDrivers array contains multiple drivers for druid, 
postgresql and calcite
 * select any of these drivers, use Java Basics -> Class Loader explorer
 * you are now shown a ChildFirstClassLoader

This means that the driver originates from the user classloader, but is 
referenced from the system classloader. If the reference in the latter is not 
removed (due to improper cleanup), then the user classloader cannot be garbage 
collected.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-25 Thread ShenDa (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183928#comment-17183928
 ] 

ShenDa commented on FLINK-19005:


[~chesnay] I'm willing to know how you can analyze the class leaking is caused 
by java.sql.DriverManager from the dump files. I'm still no thinking to locate 
the key problem. 
BTW, I tried several times to using wordcount job to reproduce metaspace OOM. 
But this time flink was running well and no metaspace OOM occurred, so It was 
my mistake.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-25 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183906#comment-17183906
 ] 

Chesnay Schepler commented on FLINK-19005:
--

[~Echo Lee] The JMX thingie is probably fine since it uses a {{WeakHashMap}}, 
but there is the {{java.sql.DriverManager}} leaking classes. This is relatively 
well known issue for JDBC.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-25 Thread Echo Lee (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183904#comment-17183904
 ] 

Echo Lee commented on FLINK-19005:
--

Yes, I turned on JMX in order to monitor the changes of Metaspace through 
JavaVisual.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-25 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183894#comment-17183894
 ] 

Chesnay Schepler commented on FLINK-19005:
--

[~Echo Lee] similar problem, user classloader is being leaked. Are you using 
JMX in any way? I see various classes from within the user classloader being 
referenced from the outside by a {{StandardMBeanIntrospector}}.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-25 Thread Echo Lee (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183887#comment-17183887
 ] 

Echo Lee commented on FLINK-19005:
--

[~chesnay] Wordcount.jar is not easy to reproduce this issue, but our own 
application is easy to reproduce. I attached the heap dump. Please help to 
analyze the reason. Thank you!

The execution environment is as follows:
 * Flink 1.11.0
 * openjdk 11.0.2 (2019-01-15)
 * G1GC
 * Max Metaspace Size 90m
 * Metaspace Size 80m
 * Batch job, submitted every 15 second
 * eventually fails TaskExecutor with OOM

 

The heap dump consists of two files, where first.bin Indicates the heap dump of 
a job, fourth.bin Indicates the heap dump that executes the job four times.

[^heap_dump_echo_lee.tar.xz]

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip, heap_dump_echo_lee.tar.xz
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-24 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183623#comment-17183623
 ] 

Chesnay Schepler commented on FLINK-19005:
--

My conclusion is that Flink is not leaking anything, and the errors are due to 
unfortunate timings or some JDK issue.

I was able to reproduce the issue when submitting jobs in directly after 
another / with 5 seconds in between, but after increasing the backoff to 1 
minute the OOM no longer occurred. The GC states also showed that the Metaspace 
usage did not continuously increase; the GC created distinct dips that 
frequently managed to match or even undercut prior dips.

[Stephans 
comment|https://issues.apache.org/jira/browse/FLINK-16408?focusedCommentId=17180577=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17180577]
 appears to apply here, at the very least for all the mentioned cases where 
Wordcounts are frequently run.

As for the original issue by [~gestevez], this looks like a clear case of 
classloaders being leaked. There are (at least) a bunch of 
{{oracle.jdbc.driver.BlockSource.ThreadedCachingBlockSource.BlockReleaser}} 
threads hanging around preventing the garbage collection.
So technically, this is a thread leak inherent to this library or caused by 
improper usage.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-24 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183217#comment-17183217
 ] 

Chesnay Schepler commented on FLINK-19005:
--

Small update, the TaskExecutor crashed while things ran in the background after 
around 2200 jobs. Trying to figure out what exactly happened, the logs don't 
contain an exception, but maybe it happened in some component hat logs it on 
DEBUG level.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-24 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183182#comment-17183182
 ] 

Chesnay Schepler commented on FLINK-19005:
--

I've added a summary to the Jira description.

[~Echo Lee] [~DaDaShen] [~gestevez] Could you tell us exactly which Java 11 
version you are using? (output of {{java -version}})

So far I was not really able to reproduce the issue. I'm submitting the Batch 
Wordcount example in a loop, such that the next one is submitted once the 
previous one finishes. I do see the Metaspace going up, but once it gets close 
to the Metaspace size maximum the GC kicks in. My current run has been going 
for about an hour and ran roughly 1500 jobs, and I can see several dips in the 
metaspace usage.

In one instance I did get a TaskExecutor crash of sorts, but increasing the 
Metaspace by as little as 2mb fixed this issue. I do not consider this a 
successful reproduction, as I've conducted the tests initially with a very low 
max size of 40mb, and it is somewhat expected that things may fail when it is 
configured to such a low value.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> 
> === Summary ==
> 
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-24 Thread Echo Lee (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182968#comment-17182968
 ] 

Echo Lee commented on FLINK-19005:
--

[~chesnay] Is there any new progress on this issue? 

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181871#comment-17181871
 ] 

Matthias commented on FLINK-19005:
--

Thanks for the update:
A quick diff shows already that there is a growing number of class generated 
through reflection:
{code:bash}
# analysis on the "after 1 execution" heap dump
FLINK-19005 grep -o "class .*" after_1/out.html | sed -e 's~class ~~g' -e 
's~~~g' | cut -d'$' -f1 | sed 's/[0-9]*$//g'| sort | uniq -c | sort -rn | 
head
 287 jdk.internal.reflect.GeneratedSerializationConstructorAccessor
 150 jdk.internal.reflect.GeneratedMethodAccessor
 136 oracle.jdbc.driver.Redirector
  41 org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache
  37 akka.remote.WireFormats
  37 akka.remote.EndpointManager
  35 
org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.cache.LocalCache
  35 akka.remote.RemoteSettings
  33 org.apache.flink.shaded.hadoop2.com.google.common.cache.LocalCache
  31 akka.remote.serialization.MiscMessageSerializer
{code}
{code:bash}
# analysis on the "after 10 executions" heap dump
FLINK-19005 grep -o "class .*" after_10/out.html | sed -e 's~class ~~g' -e 
's~~~g' | cut -d'$' -f1 | sed 's/[0-9]*$//g'| sort | uniq -c | sort -rn | 
head
 575 jdk.internal.reflect.GeneratedSerializationConstructorAccessor
 223 jdk.internal.reflect.GeneratedMethodAccessor
 136 oracle.jdbc.driver.Redirector
  49 com.sun.proxy.
  41 org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache
  37 akka.remote.WireFormats
  37 akka.remote.EndpointManager
  36 jdk.internal.reflect.GeneratedConstructorAccessor
  35 
org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.cache.LocalCache
  35 akka.remote.RemoteSettings
{code}

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181864#comment-17181864
 ] 

Guillermo Sánchez commented on FLINK-19005:
---

I have uploaded heap_dump_after_1_executions.zip fixed

Thanks

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181848#comment-17181848
 ] 

Matthias commented on FLINK-19005:
--

Hi [~gestevez], thanks for getting back to us. It looks like 
{{heap_dump_after_10_executions.zip}} contains the 
{{heap_dump_after_1_execution.zip}}. May you update the 
{{heap_dump_after_10_executions.zip}} attachment?

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181810#comment-17181810
 ] 

Guillermo Sánchez commented on FLINK-19005:
---

[~chesnay], [~mapohl], I have attached the heap dump files 
[^heap_dump_after_1_execution.zip] and [^heap_dump_after_10_executions.zip], in 
case you can analyze it.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
> Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181754#comment-17181754
 ] 

Guillermo Sánchez commented on FLINK-19005:
---

Its the same problem that I have

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread ShenDa (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181741#comment-17181741
 ] 

ShenDa commented on FLINK-19005:


[~chesnay] I didn't do this on a local cluster. I use a script to submit job 
for every 5 seconds on standalone cluster, so I don't know how many times 
execution will trigger the OOM.  it's a long time to occur by default Metaspace 
configuration. But you can observe the usage of metatspace, you'll find that 
the space never release and grows up continuously.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181739#comment-17181739
 ] 

Chesnay Schepler commented on FLINK-19005:
--

[~DaDaShen] Does this happen on a local cluster? After how many jobs does it 
happen for you?

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread Echo Lee (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181737#comment-17181737
 ] 

Echo Lee commented on FLINK-19005:
--

[~chesnay] I also encountered this problem, which I raised in the comment of 
issue [FLINK-16408|https://issues.apache.org/jira/browse/FLINK-16408]

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Chesnay Schepler
>Priority: Major
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread ShenDa (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181718#comment-17181718
 ] 

ShenDa commented on FLINK-19005:


[~chesnay] Yes, I use flink 1.11.0 with it's batch word count on jdk11 
environment.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Matthias
>Priority: Major
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181709#comment-17181709
 ] 

Chesnay Schepler commented on FLINK-19005:
--

[~DaDaShen] Are you also using 1.11.0? Are you submitting the batch WordCount?

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Matthias
>Priority: Major
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-21 Thread ShenDa (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181704#comment-17181704
 ] 

ShenDa commented on FLINK-19005:


[~mapohl] Hi Matthias,
I  meet the same problem as [~gestevez] does. To figure out Metaspace leak is 
not caused by my code, I specifically submit the word count job. And for every 
times, the Metaspace grows up and never release until the OOM occurred.

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Assignee: Matthias
>Priority: Major
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

2020-08-20 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181205#comment-17181205
 ] 

Matthias commented on FLINK-19005:
--

Hi Robert,

this looks like a Metaspace leak. The memory used for the Metaspace pool is 
quite high comparing it to other use cases (e.g. in this [mailing list 
thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/MaxMetaspace-default-may-be-to-low-td33049.html#a33072]).

One proposal is to create a heap dump to check whether there are classes that 
are not cleaned up. This could be either ab indication of Metaspace memory leak 
in the user code or Flink. Would it be possible to analyse the heap dump 
yourself or provide it for an assisted analysis? In the latter case, I would 
propose moving the discussion into [User Mailing 
List|https://flink.apache.org/community.html#mailing-lists].

Best,
Matthias
 

> used metaspace grow on every execution
> --
>
> Key: FLINK-19005
> URL: https://issues.apache.org/jira/browse/FLINK-19005
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Client / Job Submission
>Affects Versions: 1.11.1
>Reporter: Guillermo Sánchez
>Priority: Major
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)