zentol commented on a change in pull request #18678:
URL: https://github.com/apache/flink/pull/18678#discussion_r803465610



##########
File path: 
flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java
##########
@@ -273,6 +274,8 @@ public void setupYarnClient() {
         }
 
         flinkConfiguration = new 
org.apache.flink.configuration.Configuration(globalConfiguration);
+        flinkConfiguration.setString(RestOptions.ADDRESS.key(), "0.0.0.0");
+        flinkConfiguration.setString(RestOptions.BIND_ADDRESS.key(), 
"0.0.0.0");

Review comment:
       I think we should look into this further.
   It is not clear to me how changing the bind address fixes this. The TM can 
clearly register at the JM.
   
   The only trace I found in the logs is this sequence:
   
   ```
   INFO  o.a.f.yarn.YarnResourceManagerDriver              [] - TaskExecutor 
container_1644315356936_0001_01_000002(083c75858fb9:33809) will be started on 
083c75858fb9 with TaskExecutorProcessSpec {cpuCores=2.0, 
frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb 
(134217728 bytes), taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 
bytes, networkMemSize=64.000mb (67108864 bytes), managedMemorySize=230.400mb 
(241591914 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), 
jvmOverheadSize=192.000mb (201326592 bytes), numSlots=2}.
   INFO  o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requested worker container_1644315356936_0001_01_000002(083c75858fb9:33809) 
with resource spec WorkerResourceSpec {cpuCores=2.0, taskHeapSize=25.600mb 
(26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 
bytes), managedMemSize=230.400mb (241591914 bytes), numSlots=2}.
   INFO  org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] - 
Processing Event EventType: START_CONTAINER for Container 
container_1644315356936_0001_01_000002
   INFO  o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - 
Registering TaskManager with ResourceID 
container_1644315356936_0001_01_000002(083c75858fb9:33809) 
(akka.tcp://flink@083c75858fb9:33093/user/rpc/taskmanager_0) at ResourceManager
   INFO  o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - Worker 
container_1644315356936_0001_01_000002(083c75858fb9:33809) is registered.
   INFO  o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - Worker 
container_1644315356936_0001_01_000002(083c75858fb9:33809) with resource spec 
WorkerResourceSpec {cpuCores=2.0, taskHeapSize=25.600mb (26843542 bytes), 
taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), 
managedMemSize=230.400mb (241591914 bytes), numSlots=2} was requested in 
current attempt. Current pending count after registering: 0.
   INFO  o.a.f.runtime.dispatcher.MiniDispatcher           [] - Job 
f0690a993ed145c6ebb640d0682b2885 reached terminal state FINISHED.
   INFO  o.a.f.runtime.jobmaster.JobMaster                 [] - Stopping the 
JobMaster for job 'Flink Java Job at Tue Feb 08 10:15:59 UTC 2022' 
(f0690a993ed145c6ebb640d0682b2885).
   INFO  o.a.f.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
Releasing slot [064b6440e89800203d7fc93383869f26].
   INFO  o.a.f.runtime.jobmaster.JobMaster                 [] - Close 
ResourceManager connection 378fc312596e57c7dcdd9cbee98ec674: Stopping JobMaster 
for job 'Flink Java Job at Tue Feb 08 10:15:59 UTC 2022' 
(f0690a993ed145c6ebb640d0682b2885).
   INFO  o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - 
Disconnect job manager 
00000000000000000000000000000...@akka.tcp://flink@083c75858fb9:42211/user/rpc/jobmanager_1
 for job f0690a993ed145c6ebb640d0682b2885 from the resource manager.
   INFO  o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - 
Stopping worker container_1644315356936_0001_01_000002(083c75858fb9:33809).
   INFO  o.a.f.yarn.YarnResourceManagerDriver              [] - Stopping 
container container_1644315356936_0001_01_000002(083c75858fb9:33809).
   INFO  o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - 
Closing TaskExecutor connection 
container_1644315356936_0001_01_000002(083c75858fb9:33809) because: 
TaskExecutor exceeded the idle timeout.
   INFO  org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] - 
Processing Event EventType: STOP_CONTAINER for Container 
container_1644315356936_0001_01_000002
   2022-02-08 10:17:06,411 WARN  
o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - Discard 
registration from TaskExecutor 
container_1644315356936_0001_01_000002(083c75858fb9:33809) at 
(akka.tcp://flink@083c75858fb9:33093/user/rpc/taskmanager_0) because the 
framework did not recognize it
   ```
   
   This looks like a TM tries to re-register with the JM after a shutdown was 
requested.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to