[ 
https://issues.apache.org/jira/browse/YARN-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13794733#comment-13794733
 ] 

Zhijie Shen commented on YARN-975:
----------------------------------

The concern of the name-space issue is that if we have too many files per 
application, the name node may be overwhelmed if we run on top of HDFS. To 
reduce the number of files, we fit all the history data of an application, 
application attempts and containers into one TFile. Then, each TFile will 
contain:

||key||value||
|ApplicationId|ApplicationHistoryData|
|ApplicationAttemptId1|ApplicationAttemptHistoryData1|
|ApplicationAttemptId2|ApplicationAttemptHistoryData2|
|ContainerId1|ContainerHistoryData1|
|ContainerId2|ContainerHistoryData2|
|ContainerId3|ContainerHistoryData3|

The benefit is that we strictly limit the file per application to 1. However, 
even we just read the partial history data of application, for example, the 
application information, we still need to load the complete file. Hopefully, 
the meta information of an application will not be big, and will not terribly 
affect the I/O performance.

In addition, we can do application level cache to avoid accessing the secondary 
storage system all the time. However, I propose it  to be done separately.

Thoughts?

> Add a file-system implementation for history-storage
> ----------------------------------------------------
>
>                 Key: YARN-975
>                 URL: https://issues.apache.org/jira/browse/YARN-975
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>         Attachments: YARN-975.1.patch, YARN-975.2.patch, YARN-975.3.patch, 
> YARN-975.4.patch, YARN-975.5.patch
>
>
> HDFS implementation should be a standard persistence strategy of history 
> storage



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to