[ https://issues.apache.org/jira/browse/YARN-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832115#comment-13832115 ]
Vinod Kumar Vavilapalli commented on YARN-1440: ----------------------------------------------- bq. That being said, I totally agree the TFile format for aggregated logs is not very fun to wield as a user. I don't know the thought process that went into choosing it, but I suspect it was a straightforward way to aggregate all of an app's logfiles on a node into a single file in HDFS. The original reason why I picked TFile is programmatic access for users. With logs there are conflicting user cases - on one hand user would like them to be human readable and on the other hand people want to write tools. So I picked TFile for machine readability together with a log dumper to facilitate human readability. bq. Maybe one way to get the benefit of both easy-to-access logs and less namespace pressure is to go ahead and aggregate them as separate files but have a periodic process to archive logs in a har to reduce the namespace. That wouldn't address the significant additional write load this approach would place on the namenode, however. bq. My hope is reduced complexity at the log file level while punting the small file problem to the FS layer - the reasoning here being that not all filesystems which can be used on Hadoop have a small file problem! Yes, because of the later issue (NameNode load), we should think before we make this leap. HDFS is the dominant FS that people use for YARN+MR jobs and YARN need to work well there. bq. Would it be helpful for YARN to supply a public API that reads the files for you? We already have this. See AggregatedLogFormat and LogCLIHelpers. Once we have more power in HDFS, it is very likely that we'll change this to be a single file + directory structure. We can definitely move things around so that this concept of a per-node, per-app file is totally only for HDFS and for some other implementation we can have a single file. I am +1 if that is the goal - we just need to find and put appropriate abstractions. > Yarn aggregated logs are difficult for external tools to understand > ------------------------------------------------------------------- > > Key: YARN-1440 > URL: https://issues.apache.org/jira/browse/YARN-1440 > Project: Hadoop YARN > Issue Type: Improvement > Reporter: ledion bitincka > Labels: log-aggregation, logs, tfile, yarn > > The log aggregation feature in Yarn is awesome! However, the file type and > format in which the log files are aggregated into (TFile) should either be > much simpler or be made pluggable. The current TFile format forces anyone who > wants to see the files to either > a) use the web UI > b) use the CLI tools (yarn logs) or > c) write custom code to read the files > My suggestion would be to simplify the log collection by collecting and > writing the raw log files into a directory structure as follows: > {noformat} > /{log-collection-dir}/{app-id}/{container-id}/{log-file-name} > {noformat} > This way the application developers can (re)use a much wider array of tools > to process the logs. > For the readers who are not familiar with logs and their format you can find > more info the following two blog posts: > http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/ > http://blogs.splunk.com/2013/11/18/hadoop-2-0-rant/ -- This message was sent by Atlassian JIRA (v6.1#6144)