[ https://issues.apache.org/jira/browse/HDFS-11588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947287#comment-15947287 ]
Yongjun Zhang commented on HDFS-11588: -------------------------------------- Hi [~wheat9], Thanks for proposing this new feature and working on. We could do the same when creating audit log, what do you think? > Output Avro format in the offline editlog viewer > ------------------------------------------------ > > Key: HDFS-11588 > URL: https://issues.apache.org/jira/browse/HDFS-11588 > Project: Hadoop HDFS > Issue Type: New Feature > Components: tools > Reporter: Haohui Mai > Assignee: Haohui Mai > > We found that it is handy to import the edit logs into query engines (e.g., > Hive / Presto) to understand the usages of the cluster. Some examples include: > * The size of the data and the number of files that are written into a > directory > * The distribution of the operations, for different directories. > * The number of files that are created by a user. > The answers to the above questions give insights on the usages of the > clusters and have significant values on capacity planning. > Importing the edit log into query engines simplifies the tasks of answering > these questions, and they can be answered efficiently. > While the Offline Editlog Viewer (OEV) supports outputting editlogs in XML > formats, we found that it is time-consuming to transforming the XML format to > formats that query engines recognize, because the generating the editlogs in > XML formats and transforming them into formats that the query engine > understands takes significant amount of time. In our environment it takes > minutes to prepare a 100MB editlog file into a corresponding Parquet file. > This jira proposes to extend the OEV to output Avro files to make this > process efficient. As an internal tool, the Avro output format has certain > pre-defined schemas but it does not have the constraint of maintaining > backward compatibility of the output, which is similar to the XML output > format. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org