Hi folks,

When submitting a Python word count job to a Flink session/standalone
cluster repeatedly, the meta space usage of the task manager of the Flink
cluster will continuously increase (about 40MB each time). The reason is
that the Beam classes are loaded with the user class loader(child-first by
default) in Flink and there is a minor problem with the implementation of
`ProcessManager`(from Beam) and `ThreadPoolCache`(from Netty) which may
cause the user class loader could not be garbage collected even after the
job finished which causes the meta space memory leak eventually. You can
refer to FLINK-15338[1] for more information.

Regarding to `ProcessManager`, I have created a JIRA BEAM-9006[2] to track
it. Regarding to `ThreadPoolCache`, it is a Netty problem and has been
fixed in NETTY#8955[3]. Netty 4.1.35 Final has already included this fix
and GRPC 1.22.0 has already dependents on Netty 4.1.35 Final. So we need to
bump the version of GRPC to 1.22.0+ (currently 1.21.0).

My proposal is to upgrade the GRPC version to the 1.22.0+ (May be the
latest 1.26.0?)

I've created JIRA [4], but I'm not sure if there will be any other problems
with the bump the version of GRPC up. So, I'd like to bring up this
discussion and welcome your feedback !

[1] https://issues.apache.org/jira/browse/FLINK-15338
[2] https://issues.apache.org/jira/browse/BEAM-9006
[3] https://github.com/netty/netty/pull/8955
[4] https://issues.apache.org/jira/browse/BEAM-9030

Best,
Jincheng

Reply via email to