[ 
https://issues.apache.org/jira/browse/YARN-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831847#comment-13831847
 ] 

ledion bitincka commented on YARN-1440:
---------------------------------------

[~zjffdu] - I stand corrected, there's currently one TFile per application per 
node with container_id used as the key. While this is better than creating one 
file per container, it still leaves the cluster exposed to the small file 
problem, imagine a 1000 node cluster, running 10000 apps/day - this would lead 
to 10M new TFiles. My hope is reduced complexity at the log file level while 
punting the small file problem to the FS layer - the reasoning here being that 
not all filesystems which can be used on Hadoop have a small file problem!

> Yarn aggregated logs are difficult for external tools to understand
> -------------------------------------------------------------------
>
>                 Key: YARN-1440
>                 URL: https://issues.apache.org/jira/browse/YARN-1440
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: ledion bitincka
>              Labels: log-aggregation, logs, tfile, yarn
>
> The log aggregation feature in Yarn is awesome! However, the file type and 
> format in which the log files are aggregated into (TFile) should either be 
> much simpler or be made pluggable. The current TFile format forces anyone who 
> wants to see the files to either 
> a) use the web UI
> b) use the CLI tools (yarn logs)  or 
> c) write custom code to read the files 
> My suggestion would be to simplify the log collection by collecting and 
> writing the raw log files into a directory structure as follows: 
> {noformat}
> /{log-collection-dir}/{app-id}/{container-id}/{log-file-name} 
> {noformat}
> This way the application developers can (re)use a much wider array of tools 
> to process the logs. 
> For the readers who are not familiar with logs and their format you can find 
> more info the following two blog posts:
> http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/
> http://blogs.splunk.com/2013/11/18/hadoop-2-0-rant/



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to