Steven Zhen Wu created FLINK-18350:
--------------------------------------
Summary: [1.11.0] jobmanager complains
`taskmanager.memory.process.size` missing
Key: FLINK-18350
URL: https://issues.apache.org/jira/browse/FLINK-18350
Project: Flink
Issue Type: Bug
Components: Runtime / Configuration
Affects Versions: 1.11.0
Reporter: Steven Zhen Wu
Saw this failure in jobmanager startup. I know the exception said that
`taskmanager.memory.process.size` missing. We set it at taskmanager side in
`flink-conf.yaml`. But I am wondering why is this required by jobmanager for
session cluster mode. When taskmanager registering with jobmanager, it reports
the resources (like CPU, memory etc.).
{code:java}
2020-06-17 18:06:25,079 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint [main] - Could
not start cluster entrypoint TitusSessionClusterEntrypoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to
initialize the cluster entrypoint TitusSessionClusterEntrypoint.
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:187)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:516)
at
com.netflix.spaas.runtime.TitusSessionClusterEntrypoint.main(TitusSessionClusterEntrypoint.java:103)
Caused by: org.apache.flink.util.FlinkException: Could not create the
DispatcherResourceManagerComponent.
at
org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:255)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:216)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168)
... 2 more
Caused by: org.apache.flink.configuration.IllegalConfigurationException: Cannot
read memory size from config option 'taskmanager.memory.process.size'.
at
org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.getMemorySizeFromConfig(ProcessMemoryUtils.java:234)
at
org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.deriveProcessSpecWithTotalProcessMemory(ProcessMemoryUtils.java:100)
at
org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.memoryProcessSpecFromConfig(ProcessMemoryUtils.java:79)
at
org.apache.flink.runtime.clusterframework.TaskExecutorProcessUtils.processSpecFromConfig(TaskExecutorProcessUtils.java:109)
at
org.apache.flink.runtime.clusterframework.TaskExecutorProcessSpecBuilder.build(TaskExecutorProcessSpecBuilder.java:58)
at
org.apache.flink.runtime.resourcemanager.WorkerResourceSpecFactory.workerResourceSpecFromConfigAndCpu(WorkerResourceSpecFactory.java:37)
at
com.netflix.spaas.runtime.resourcemanager.TitusWorkerResourceSpecFactory.createDefaultWorkerResourceSpec(TitusWorkerResourceSpecFactory.java:17)
at
org.apache.flink.runtime.resourcemanager.ResourceManagerRuntimeServicesConfiguration.fromConfiguration(ResourceManagerRuntimeServicesConfiguration.java:67)
at
com.netflix.spaas.runtime.resourcemanager.TitusResourceManagerFactory.createResourceManager(TitusResourceManagerFactory.java:53)
at
org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:167)
... 9 more
Caused by: java.lang.IllegalArgumentException: Could not parse value '7500}'
for key 'taskmanager.memory.process.size'.
at
org.apache.flink.configuration.Configuration.getOptional(Configuration.java:753)
at
org.apache.flink.configuration.Configuration.get(Configuration.java:738)
at
org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.getMemorySizeFromConfig(ProcessMemoryUtils.java:232)
... 18 more
Caused by: java.lang.IllegalArgumentException: Memory size unit '}' does not
match any of the recognized units: (b | bytes) / (k | kb | kibibytes) / (m | mb
| mebibytes) / (g | gb | gibibytes) / (t | tb | tebibytes)
at
org.apache.flink.configuration.MemorySize.parseUnit(MemorySize.java:331)
at
org.apache.flink.configuration.MemorySize.parseBytes(MemorySize.java:306)
at org.apache.flink.configuration.MemorySize.parse(MemorySize.java:247)
at
org.apache.flink.configuration.Configuration.convertToMemorySize(Configuration.java:951)
at
org.apache.flink.configuration.Configuration.convertValue(Configuration.java:885)
at
org.apache.flink.configuration.Configuration.lambda$getOptional$2(Configuration.java:750)
at java.util.Optional.map(Optional.java:215)
at
org.apache.flink.configuration.Configuration.getOptional(Configuration.java:750)
... 20 more
{code}
We extend from WorkerResourceSpecFactory similar to
KubernetesWorkerResourceSpecFactory.
{code:java}
public class TitusWorkerResourceSpecFactory extends WorkerResourceSpecFactory {
public static final TitusWorkerResourceSpecFactory INSTANCE =
new TitusWorkerResourceSpecFactory();
@Override
public WorkerResourceSpec createDefaultWorkerResourceSpec(Configuration
configuration) {
return workerResourceSpecFromConfigAndCpu(configuration,
getDefaultCpus(configuration));
}
@VisibleForTesting
static CPUResource getDefaultCpus(Configuration configuration) {
double fallback = Double.valueOf(System.getenv("TITUS_NUM_CPU"));
return TaskExecutorProcessUtils.getCpuCoresWithFallback(configuration,
fallback);
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)