Kengo Seki created AIRFLOW-2452: ----------------------------------- Summary: Document field_dict for HiveCliHook.load_file must be OrderedDict Key: AIRFLOW-2452 URL: https://issues.apache.org/jira/browse/AIRFLOW-2452 Project: Apache Airflow Issue Type: Improvement Components: docs, Documentation, hive_hooks, hooks Reporter: Kengo Seki Assignee: Kengo Seki
HiveCliHook.load_file has a parameter called field_dict, which defines name-type pairs for columns, must be OrderedDict. If not, users can get unexpected result. Example: Given the following input file: {code} $ head /tmp/baby_names.csv 1880,John,0.081541,boy 1880,William,0.080511,boy 1880,James,0.050057,boy 1880,Charles,0.045167,boy 1880,George,0.043292,boy 1880,Frank,0.02738,boy 1880,Joseph,0.022229,boy 1880,Thomas,0.021401,boy 1880,Henry,0.020641,boy {code} Load the file via HiveCliHook.load_file with field_dict as a normal dict: {code} In [1]: from airflow.hooks.hive_hooks import HiveCliHook In [2]: hook = HiveCliHook() [2018-05-10 19:49:31,819] {base_hook.py:85} INFO - Using connection to: localhost In [3]: field_dict = { ...: "year": "INT", ...: "name": "STRING", ...: "pct": "DOUBLE", ...: "sex": "STRING", ...: } In [4]: hook.load_file(filepath="/tmp/baby_names.csv", table="baby_names", field_dict=field_dict, recreate=True) [2018-05-10 19:51:53,854] {hive_hooks.py:424} INFO - DROP TABLE IF EXISTS baby_names; CREATE TABLE IF NOT EXISTS baby_names ( sex STRING, name STRING, pct DOUBLE, year INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile ; (snip) [2018-05-10 19:52:17,965] {hive_hooks.py:232} INFO - Table default.baby_names stats: [numFiles=1, numRows=0, totalSize=1289, rawDataSize=0] [2018-05-10 19:52:17,966] {hive_hooks.py:232} INFO - OK [2018-05-10 19:52:17,967] {hive_hooks.py:232} INFO - Time taken: 1.349 seconds {code} The file is loaded, but fields in the CREATE TABLE statement are disordered. So the loaded data is not correctly selected from Hive: {code} hive> SELECT * FROM baby_names LIMIT 10; OK 1880 John 0.081541 NULL 1880 William 0.080511 NULL 1880 James 0.050057 NULL 1880 Charles 0.045167 NULL 1880 George 0.043292 NULL 1880 Frank 0.02738 NULL 1880 Joseph 0.022229 NULL 1880 Thomas 0.021401 NULL 1880 Henry 0.020641 NULL 1880 Robert 0.020404 NULL Time taken: 2.465 seconds, Fetched: 10 row(s) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)