[ 
https://issues.apache.org/jira/browse/HDFS-11588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HDFS-11588:
-------------------------------
    Component/s: tools

> Output Avro format in the offline editlog viewer
> ------------------------------------------------
>
>                 Key: HDFS-11588
>                 URL: https://issues.apache.org/jira/browse/HDFS-11588
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>            Reporter: Haohui Mai
>            Assignee: Haohui Mai
>
> We found that it is handy to import the edit logs into query engines (e.g., 
> Hive / Presto) to understand the usages of the cluster. Some examples include:
> * The size of the data and the number of files that are written into a 
> directory
> * The distribution of the operations, for different directories.
> * The number of files that are created by a user.
> The answers to the above questions give insights on the usages of the 
> clusters and have significant values on capacity planning.
> Importing the edit log into query engines simplifies the tasks of answering 
> these questions, and they can be answered efficiently.
> While the Offline Editlog Viewer (OEV) supports outputting editlogs in XML 
> formats, we found that it is time-consuming to transforming the XML format to 
> formats that query engines recognize, because the generating the editlogs in 
> XML formats and transforming them into formats that the query engine 
> understands takes significant amount of time. In our environment it takes 
> minutes to prepare a 100MB editlog file into a corresponding Parquet file.
> This jira proposes to extend the OEV to output Avro files to make this 
> process efficient. As an internal tool, the Avro output format has certain 
> pre-defined schemas but it does not have the constraint of maintaining 
> backward compatibility of the output, which is similar to the XML output 
> format.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to