[ https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370898#comment-16370898 ]
Xuan Gong commented on YARN-7952: --------------------------------- Right now, the NM would send its own log aggregation status to RM periodically to RM. And RM would aggregate the status for each application, but it will not generate the final status until a client call(from web ui or cli) trigger it. But RM never persists the log aggregation status. So, when RM restarts/fails over, the log aggregation status will become “NOT_STARTED”. This is confusing, maybe we should change it to “NOT_AVAILABLE” (will create a separate ticket for this). Anyway, we need to persist the log aggregation status for the future use. Option one: the centralized approach. Create a new service called LogAggregationTrackingService in RM which will track the log aggregation status for all applications. We can also introduce “EXPIRY_INTERVAL_MS”. The service can wake up periodically to check the log aggregation progress. This log aggregationTrackingService will be similar to a LivenessMonitor(such as AMLivenessMonitor). After EXPIRY_INTERVAL_MS, the service would trigger an update RMStateStore event to persist the final log aggregation status. So, we need to add one more RMStateStore event for every application. Also, when RM restart/fail-over happens between the EXPIRY_INTERVAL_MS, we still lose the log aggregation status. Option two: only care about log aggregation status for the latest applications. This approach will not persist the log aggregation status, so we will not need to trigger a new RMStateStore event. When NM sends the log aggregation status to RM, it will save a copy in its own memory(do we need to persist in NM state store ???). We also introduce “EXPIRY_INTERVAL_MS”. When RM restarts/fails over, NM would do re-register to RM. At this time, NM would send the previous copy of the log aggregation status to RM based on the configured “EXPIRY_INTERVAL_MS” (current_timestamp-last_updated_timestamp <= EXPIRY_INTERVAL_MS). So, the RM could re-generate the log aggregation status. Most of the changes will happen on NM side. Option three: Option one + Option two > Find a way to persist the log aggregation status > ------------------------------------------------ > > Key: YARN-7952 > URL: https://issues.apache.org/jira/browse/YARN-7952 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Xuan Gong > Assignee: Xuan Gong > Priority: Major > > In MAPREDUCE-6415, we have created a CLI to har the aggregated logs, and In > YARN-4946: RM should write out Aggregated Log Completion file flag next to > logs, we have a discussion on how we can get the log aggregation status: make > a client call to RM or get it directly from the Distributed file system(HDFS). > No matter which approach we would like to choose, we need to figure out a way > to persist the log aggregation status first. This ticket is used to track the > working progress for this purpose. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org