[jira] [Updated] (HDFS-11588) Output Avro format in the offline editlog viewer
[ https://issues.apache.org/jira/browse/HDFS-11588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HDFS-11588: --- Component/s: tools > Output Avro format in the offline editlog viewer > > > Key: HDFS-11588 > URL: https://issues.apache.org/jira/browse/HDFS-11588 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Reporter: Haohui Mai >Assignee: Haohui Mai > > We found that it is handy to import the edit logs into query engines (e.g., > Hive / Presto) to understand the usages of the cluster. Some examples include: > * The size of the data and the number of files that are written into a > directory > * The distribution of the operations, for different directories. > * The number of files that are created by a user. > The answers to the above questions give insights on the usages of the > clusters and have significant values on capacity planning. > Importing the edit log into query engines simplifies the tasks of answering > these questions, and they can be answered efficiently. > While the Offline Editlog Viewer (OEV) supports outputting editlogs in XML > formats, we found that it is time-consuming to transforming the XML format to > formats that query engines recognize, because the generating the editlogs in > XML formats and transforming them into formats that the query engine > understands takes significant amount of time. In our environment it takes > minutes to prepare a 100MB editlog file into a corresponding Parquet file. > This jira proposes to extend the OEV to output Avro files to make this > process efficient. As an internal tool, the Avro output format has certain > pre-defined schemas but it does not have the constraint of maintaining > backward compatibility of the output, which is similar to the XML output > format. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11588) Output Avro format in the offline editlog viewer
[ https://issues.apache.org/jira/browse/HDFS-11588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HDFS-11588: --- Issue Type: New Feature (was: Bug) > Output Avro format in the offline editlog viewer > > > Key: HDFS-11588 > URL: https://issues.apache.org/jira/browse/HDFS-11588 > Project: Hadoop HDFS > Issue Type: New Feature > Components: tools >Reporter: Haohui Mai >Assignee: Haohui Mai > > We found that it is handy to import the edit logs into query engines (e.g., > Hive / Presto) to understand the usages of the cluster. Some examples include: > * The size of the data and the number of files that are written into a > directory > * The distribution of the operations, for different directories. > * The number of files that are created by a user. > The answers to the above questions give insights on the usages of the > clusters and have significant values on capacity planning. > Importing the edit log into query engines simplifies the tasks of answering > these questions, and they can be answered efficiently. > While the Offline Editlog Viewer (OEV) supports outputting editlogs in XML > formats, we found that it is time-consuming to transforming the XML format to > formats that query engines recognize, because the generating the editlogs in > XML formats and transforming them into formats that the query engine > understands takes significant amount of time. In our environment it takes > minutes to prepare a 100MB editlog file into a corresponding Parquet file. > This jira proposes to extend the OEV to output Avro files to make this > process efficient. As an internal tool, the Avro output format has certain > pre-defined schemas but it does not have the constraint of maintaining > backward compatibility of the output, which is similar to the XML output > format. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org