[ https://issues.apache.org/jira/browse/FLINK-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006647#comment-17006647 ]
Xintong Song commented on FLINK-15448: -------------------------------------- I wonder how many cases do we have that we want to log TMs' host information together with ResourceID while the host information is not available. I believe for most cases, if not all, such information can be easily get. Take your example, you can get the host information in 'notifyHeartbeatTimeout' by simply calling 'registeredTaskManagers.get(resourceID).f0.getHostname()'. It's true that initializing ResourceID with the host information on Yarn is an easier approach. But this approach also brings the backwards compatibility concern. There might be users who already built their own monitoring / analyzing systems, which may depend on the consistency between TM ResourceIDs in Flink logs / metrics and container IDs in Yarn logs / metrics. If we change the convention how TM ResourceID is generated, these users will have to change their systems as well if they want to upgrade to new Flink versions. I'm not sure the convenience brought by choosing to change ResourceID over providing host information at the logging places would worth breaking such backwards compatibility. Some additional information, logging the TM host information is not always helpful. For containerized scenarios such as K8s, where each container will have its own IP address, the TM hostname does not reflect on which machine the container is launched. Even for yarn, hadoop 3.x already supports containerized application and there are already [discussions|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Building-with-Hadoop-3-td31395.html] regarding flink hadoop 3.x supporting in the community. > Make "ResourceID#toString" more descriptive > ------------------------------------------- > > Key: FLINK-15448 > URL: https://issues.apache.org/jira/browse/FLINK-15448 > Project: Flink > Issue Type: Improvement > Affects Versions: 1.9.1 > Reporter: Victor Wong > Priority: Major > > With Flink on Yarn, sometimes we ran into an exception like this: > {code:java} > java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id > container_xxxx timed out. > {code} > We'd like to find out the host of the lost TaskManager to log into it for > more details, we have to check the previous logs for the host information, > which is a little time-consuming. > Maybe we can add more descriptive information to ResourceID of Yarn > containers, e.g. "container_xxx@host_name:port_number". > Here's the demo: > {code:java} > class ResourceID { > final String resourceId; > final String details; > public ResourceID(String resourceId) { > this.resourceId = resourceId; > this.details = resourceId; > } > public ResourceID(String resourceId, String details) { > this.resourceId = resourceId; > this.details = details; > } > public String toString() { > return details; > } > } > // in flink-yarn > private void startTaskExecutorInContainer(Container container) { > final String containerIdStr = container.getId().toString(); > final String containerDetail = container.getId() + "@" + > container.getNodeId(); > final ResourceID resourceId = new ResourceID(containerIdStr, > containerDetail); > ... > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)