Joe McDonnell created IMPALA-6052:
-------------------------------------

             Summary: Improve test data directory structure
                 Key: IMPALA-6052
                 URL: https://issues.apache.org/jira/browse/IMPALA-6052
             Project: IMPALA
          Issue Type: Improvement
          Components: Infrastructure
    Affects Versions: Impala 2.10.0
            Reporter: Joe McDonnell


Dataload generates the hdfs location using this code:
hdfs_location = '{0}.{1}{2}'.format(db_name, table_name, db_suffix)
if data_set in ['hive-benchmark', 'functional']:
  hdfs_location = hdfs_location.split('.')[-1]
Where db_suffix is used to describe the compression. Here are some examples:
functional.alltypes is stored in /test-warehouse/alltypes/
functional.alltypesagg is stored in /test-warehouse/alltypesagg/
functional_seq.alltypes is stored in /test-warehouse/alltypes_seq/
functional_seq.alltypesagg is stored in /test-warehouse/alltypesagg_seq/
Tables from the same database are not grouped into a directory. Instead, almost 
everything in functional is a top level directory. In a normal dataload, hdfs 
dfs -ls /test-warehouse results in 998 directories. This makes it hard to 
browse our HDFS directory structure. It also makes it hard to import/export a 
single database and its tables.

The tables for a database should be in a single directory for that database. 
The hdfs location should be of the form 
"${db_name}${db_suffix}.db/${table_name}". functional.alltypes should be in 
'/test-warehouse/functional.db/alltypes'. The top level directory should end up 
with about 50 items with the default dataload.

This will require changes in generate_schema_statement.py (in 
generate_statmments() when generating the hdfs_location). It will also require 
changes to the schema templates such as 
testdata/datasets/functional/functional_schema_template.sql. It is also likely 
to require corresponding test changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to