[ 
https://issues.apache.org/jira/browse/FLINK-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006647#comment-17006647
 ] 

Xintong Song commented on FLINK-15448:
--------------------------------------

I wonder how many cases do we have that we want to log TMs' host information 
together with ResourceID while the host information is not available. I believe 
for most cases, if not all, such information can be easily get. Take your 
example, you can get the host information in 'notifyHeartbeatTimeout'  by 
simply calling 'registeredTaskManagers.get(resourceID).f0.getHostname()'.

It's true that initializing ResourceID with the host information on Yarn is an 
easier approach. But this approach also brings the backwards compatibility 
concern. There might be users who already built their own monitoring / 
analyzing systems, which may depend on the consistency between TM ResourceIDs 
in Flink logs / metrics and container IDs in Yarn logs / metrics. If we change 
the convention how TM ResourceID is generated, these users will have to change 
their systems as well if they want to upgrade to new Flink versions. I'm not 
sure the convenience brought by choosing to change ResourceID over providing 
host information at the logging places would worth breaking such backwards 
compatibility.

Some additional information, logging the TM host information is not always 
helpful. For containerized scenarios such as K8s, where each container will 
have its own IP address, the TM hostname does not reflect on which machine the 
container is launched. Even for yarn, hadoop 3.x already supports containerized 
application and there are already 
[discussions|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Building-with-Hadoop-3-td31395.html]
 regarding flink hadoop 3.x supporting in the community.

> Make "ResourceID#toString" more descriptive
> -------------------------------------------
>
>                 Key: FLINK-15448
>                 URL: https://issues.apache.org/jira/browse/FLINK-15448
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.9.1
>            Reporter: Victor Wong
>            Priority: Major
>
> With Flink on Yarn, sometimes we ran into an exception like this:
> {code:java}
> java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id 
> container_xxxx  timed out.
> {code}
> We'd like to find out the host of the lost TaskManager to log into it for 
> more details, we have to check the previous logs for the host information, 
> which is a little time-consuming.
> Maybe we can add more descriptive information to ResourceID of Yarn 
> containers, e.g. "container_xxx@host_name:port_number".
> Here's the demo:
> {code:java}
> class ResourceID {
>   final String resourceId;
>   final String details;
>   public ResourceID(String resourceId) {
>     this.resourceId = resourceId;
>     this.details = resourceId;
>   }
>   public ResourceID(String resourceId, String details) {
>     this.resourceId = resourceId;
>     this.details = details;
>   }
>   public String toString() {
>     return details;
>   }     
> }
> // in flink-yarn
> private void startTaskExecutorInContainer(Container container) {
>   final String containerIdStr = container.getId().toString();
>   final String containerDetail = container.getId() + "@" + 
> container.getNodeId();  
>   final ResourceID resourceId = new ResourceID(containerIdStr, 
> containerDetail);
>   ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to