[ 
https://issues.apache.org/jira/browse/YARN-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832115#comment-13832115
 ] 

Vinod Kumar Vavilapalli commented on YARN-1440:
-----------------------------------------------

bq. That being said, I totally agree the TFile format for aggregated logs is 
not very fun to wield as a user. I don't know the thought process that went 
into choosing it, but I suspect it was a straightforward way to aggregate all 
of an app's logfiles on a node into a single file in HDFS.
The original reason why I picked TFile is programmatic access for users. With 
logs there are conflicting user cases - on one hand user would like them to be 
human readable and on the other hand people want to write tools. So I picked 
TFile for machine readability together with a log dumper to facilitate human 
readability.

bq. Maybe one way to get the benefit of both easy-to-access logs and less 
namespace pressure is to go ahead and aggregate them as separate files but have 
a periodic process to archive logs in a har to reduce the namespace. That 
wouldn't address the significant additional write load this approach would 
place on the namenode, however.
bq. My hope is reduced complexity at the log file level while punting the small 
file problem to the FS layer - the reasoning here being that not all 
filesystems which can be used on Hadoop have a small file problem!
Yes, because of the later issue (NameNode load), we should think before we make 
this leap. HDFS is the dominant FS that people use for YARN+MR jobs and YARN 
need to work well there.

bq. Would it be helpful for YARN to supply a public API that reads the files 
for you?
We already have this. See AggregatedLogFormat and LogCLIHelpers.

Once we have more power in HDFS, it is very likely that we'll change this to be 
a single file + directory structure.

We can definitely move things around so that this concept of a per-node, 
per-app file is totally only for HDFS and for some other implementation we can 
have a single file. I am +1 if that is the goal - we just need to find and put 
appropriate abstractions.

> Yarn aggregated logs are difficult for external tools to understand
> -------------------------------------------------------------------
>
>                 Key: YARN-1440
>                 URL: https://issues.apache.org/jira/browse/YARN-1440
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: ledion bitincka
>              Labels: log-aggregation, logs, tfile, yarn
>
> The log aggregation feature in Yarn is awesome! However, the file type and 
> format in which the log files are aggregated into (TFile) should either be 
> much simpler or be made pluggable. The current TFile format forces anyone who 
> wants to see the files to either 
> a) use the web UI
> b) use the CLI tools (yarn logs)  or 
> c) write custom code to read the files 
> My suggestion would be to simplify the log collection by collecting and 
> writing the raw log files into a directory structure as follows: 
> {noformat}
> /{log-collection-dir}/{app-id}/{container-id}/{log-file-name} 
> {noformat}
> This way the application developers can (re)use a much wider array of tools 
> to process the logs. 
> For the readers who are not familiar with logs and their format you can find 
> more info the following two blog posts:
> http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/
> http://blogs.splunk.com/2013/11/18/hadoop-2-0-rant/



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to