[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485527#comment-14485527 ]
Zhijie Shen commented on YARN-3044: ----------------------------------- Before screening the patch details, I have some high level comments: bq. IIUC you meant we will have RMContainerEntity having type as YARN_RM_CONTAINER and NMContainerEntity having type as YARN_NM_CONTAINER right ? Can we use ContainerEntity. The events from RM are RM_XXXX_EVENT, and those from NM are NM_XXXX_EVENT. bq. I'm very much concerned about the volume of writes that the RM collector would need to do, bq. I fully understand the concern from Sangjin Lee that RM may not afford tens of thousands containers in large size cluster. I also think publishing all container lifecycle events from NM is likely to be a big cost in total, but I'd like to provide some point from other point of view. Say we have a big cluster that can afford 5,000 concurrent containers. RM have to maintain the lifecycle of these 5K containers, and I don't think a less powerful server can manage it, right? Assume we have such a powerful server to run the RM of a big cluster, will publishing lifecycle events be a big deal to the server? I'm not sure, but I can provide some hints. Now each container will write 2 events per lifecycle, and perhaps in the future we want to record each state transition, and result in ~10 events per lifecycle. Therefore, we have 10 * 5K lifecycle events, and they won't be written at the same moment because containers' lifecycles are usually async. Let's assume each container run for 1h and lifecycle events are uniformly distributed, in each second, there will just be around 14 concurrent writes (for a powerful server). I think we may overestimate the performance impact of writing NM lifecycles. Perhaps a more reasonable performance metric is {{cost of writing lifecycle events per container / cost of managing lifecycle per container * 100%}}. For example, if it is 2%, I guess it will probably be acceptable. bq. all configs will not be set as part of this so was there more planned for this from the framework side or each application needs to take care of this on their own to populate configuration information ? bq. In that sense, how about letting frameworks (namely AMs) write the configuration instead of RM? I'm not sure if I understand this part correctly, but I incline that system timeline data (RM/NM) is controlled by cluster config and per cluster, while application data is controlled by framework or even per-application config. It may have some problem if the user is able to change the former config. For example, he can hide its application information from cluster admin. bq. I have also incorporated the changes to support RMContainer metrics based on configuration (Junping's comments). Do you mean we should keep {{yarn.resourcemanager.system-metrics-publisher.enabled}} to control RM SMP, and and create {{yarn.nodemanager.system-metrics-publisher.enabled}} to control NM SMP? > [Event producers] Implement RM writing app lifecycle events to ATS > ------------------------------------------------------------------ > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Reporter: Sangjin Lee > Assignee: Naganarasimha G R > Attachments: YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)