Steven Zhen Wu created FLINK-18350:
--------------------------------------

             Summary: [1.11.0] jobmanager complains 
`taskmanager.memory.process.size` missing
                 Key: FLINK-18350
                 URL: https://issues.apache.org/jira/browse/FLINK-18350
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Configuration
    Affects Versions: 1.11.0
            Reporter: Steven Zhen Wu


 

Saw this failure in jobmanager startup. I know the exception said that 
`taskmanager.memory.process.size` missing. We set it at taskmanager side in 
`flink-conf.yaml`. But I am wondering why is this required by jobmanager for 
session cluster mode. When taskmanager registering with jobmanager, it reports 
the resources (like CPU, memory etc.).  
{code:java}
2020-06-17 18:06:25,079 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [main]  - Could 
not start cluster entrypoint TitusSessionClusterEntrypoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to 
initialize the cluster entrypoint TitusSessionClusterEntrypoint.
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:187)
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:516)
        at 
com.netflix.spaas.runtime.TitusSessionClusterEntrypoint.main(TitusSessionClusterEntrypoint.java:103)
Caused by: org.apache.flink.util.FlinkException: Could not create the 
DispatcherResourceManagerComponent.
        at 
org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:255)
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:216)
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
        at 
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168)
        ... 2 more
Caused by: org.apache.flink.configuration.IllegalConfigurationException: Cannot 
read memory size from config option 'taskmanager.memory.process.size'.
        at 
org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.getMemorySizeFromConfig(ProcessMemoryUtils.java:234)
        at 
org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.deriveProcessSpecWithTotalProcessMemory(ProcessMemoryUtils.java:100)
        at 
org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.memoryProcessSpecFromConfig(ProcessMemoryUtils.java:79)
        at 
org.apache.flink.runtime.clusterframework.TaskExecutorProcessUtils.processSpecFromConfig(TaskExecutorProcessUtils.java:109)
        at 
org.apache.flink.runtime.clusterframework.TaskExecutorProcessSpecBuilder.build(TaskExecutorProcessSpecBuilder.java:58)
        at 
org.apache.flink.runtime.resourcemanager.WorkerResourceSpecFactory.workerResourceSpecFromConfigAndCpu(WorkerResourceSpecFactory.java:37)
        at 
com.netflix.spaas.runtime.resourcemanager.TitusWorkerResourceSpecFactory.createDefaultWorkerResourceSpec(TitusWorkerResourceSpecFactory.java:17)
        at 
org.apache.flink.runtime.resourcemanager.ResourceManagerRuntimeServicesConfiguration.fromConfiguration(ResourceManagerRuntimeServicesConfiguration.java:67)
        at 
com.netflix.spaas.runtime.resourcemanager.TitusResourceManagerFactory.createResourceManager(TitusResourceManagerFactory.java:53)
        at 
org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:167)
        ... 9 more
Caused by: java.lang.IllegalArgumentException: Could not parse value '7500}' 
for key 'taskmanager.memory.process.size'.
        at 
org.apache.flink.configuration.Configuration.getOptional(Configuration.java:753)
        at 
org.apache.flink.configuration.Configuration.get(Configuration.java:738)
        at 
org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.getMemorySizeFromConfig(ProcessMemoryUtils.java:232)
        ... 18 more
Caused by: java.lang.IllegalArgumentException: Memory size unit '}' does not 
match any of the recognized units: (b | bytes) / (k | kb | kibibytes) / (m | mb 
| mebibytes) / (g | gb | gibibytes) / (t | tb | tebibytes)
        at 
org.apache.flink.configuration.MemorySize.parseUnit(MemorySize.java:331)
        at 
org.apache.flink.configuration.MemorySize.parseBytes(MemorySize.java:306)
        at org.apache.flink.configuration.MemorySize.parse(MemorySize.java:247)
        at 
org.apache.flink.configuration.Configuration.convertToMemorySize(Configuration.java:951)
        at 
org.apache.flink.configuration.Configuration.convertValue(Configuration.java:885)
        at 
org.apache.flink.configuration.Configuration.lambda$getOptional$2(Configuration.java:750)
        at java.util.Optional.map(Optional.java:215)
        at 
org.apache.flink.configuration.Configuration.getOptional(Configuration.java:750)
        ... 20 more
{code}
We extend from WorkerResourceSpecFactory similar to 
KubernetesWorkerResourceSpecFactory.
{code:java}
public class TitusWorkerResourceSpecFactory extends WorkerResourceSpecFactory {

  public static final TitusWorkerResourceSpecFactory INSTANCE =
      new TitusWorkerResourceSpecFactory();

  @Override
  public WorkerResourceSpec createDefaultWorkerResourceSpec(Configuration 
configuration) {
    return workerResourceSpecFromConfigAndCpu(configuration, 
getDefaultCpus(configuration));
  }

  @VisibleForTesting
  static CPUResource getDefaultCpus(Configuration configuration) {
    double fallback = Double.valueOf(System.getenv("TITUS_NUM_CPU"));
    return TaskExecutorProcessUtils.getCpuCoresWithFallback(configuration, 
fallback);
  }
}
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to