[ 
https://issues.apache.org/jira/browse/IMPALA-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell closed IMPALA-3607.
---------------------------------
    Resolution: Won't Fix

Closing this old issue. Things have changed substantially since this was filed 
and snapshots aren't as important as they were.

> Reduce test data loading time from snapshot
> -------------------------------------------
>
>                 Key: IMPALA-3607
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3607
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Infrastructure
>    Affects Versions: Impala 2.5.0
>            Reporter: Dimitris Tsirogiannis
>            Priority: Minor
>              Labels: test-infra
>
> Loading test data from snapshot takes a significant amount of time 
> (~20-30min). Given the amount of data loaded (~4GB), the process of loading 
> test data to a local 3-node min-hdfs cluster should be significantly faster. 
> The process currently works as follows:
> 1. Download the latest snapshot 
> 2. Unzip 
> 3. Use hdfs dfs -put command to copy from local file system to hdfs
> We believe the bulk of the time goes to step #3 and is attributed to namenode 
> overhead. Below are a few ideas we can try to improve this:
> 1. Use a backup and restore approach for hdfs metadata/data that doesn't go 
> through the namenode. For example, once data is loaded to an hdfs cluster 
> using the old approach create two snapshots, one for metadata and one for 
> data. Loading the test data is just a matter of unzipping the snapshots to 
> the appropriate directories. A similar approach is used to backup and restore 
> hdfs clusters 
> (http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_hdfs_metadata_backup.html).
>  A jenkins job would still be responsible for checking for changes in test 
> data, do the slow data loading and creating the new snapshots. 
> 2. Other ideas include the use of EC2 AMIs, docker and/or hdfs checkpointing. 
> 3. Use faster compression/decompression tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to