I have several Parquet files on Amazon S3, which have three fields, username STRING, request STRING and timestamp STRING.
If I create a table in Hive and add a partition, everything is fine. CREATE EXTERNAL TABLE IF NOT EXISTS access_logs(username STRING, request STRING, `timestamp` STRING) PARTITIONED BY (year INT, month INT, day INT, hour INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' LOCATION 's3a://logstash-access-structured/'; ALTER TABLE access_logs ADD IF NOT EXISTS PARTITION(year=2016, month=7, day=16, hour=21) LOCATION 's3a://chef-logstash-access-structured/2016/07/16/21'; select *from access_logs where year=2016 and month=7 and day=16 and hour=21 and request!='' limit 100; Experiment 1 Create a table without the request column -> add a column named request -> add partition -> select -> correct. CREATE EXTERNAL TABLE IF NOT EXISTS access_logs1(username STRING, `timestamp` STRING) PARTITIONED BY (year INT, month INT, day INT, hour INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' LOCATION 's3a://logstash-access-structured/'; ALTER TABLE access_logs1 ADD COLUMNS (request STRING); ALTER TABLE access_logs1 ADD IF NOT EXISTS PARTITION(year=2016, month=7, day=16, hour=21) LOCATION 's3a://chef-logstash-access-structured/2016/07/16/21'; select *from access_logs1 where year=2016 and month=7 and day=16 and hour=21 and request!='' limit 100; The request column is mapped to the Parquet files correctly. Experiment 2 Create a table without the request column -> add partition -> add a column named request -> select -> all NULL values. CREATE EXTERNAL TABLE IF NOT EXISTS access_logs2(username STRING, `timestamp` STRING) PARTITIONED BY (year INT, month INT, day INT, hour INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' LOCATION 's3a://logstash-access-structured/'; ALTER TABLE access_logs2 ADD IF NOT EXISTS PARTITION(year=2016, month=7, day=16, hour=21) LOCATION 's3a://chef-logstash-access-structured/2016/07/16/21'; ALTER TABLE access_logs2 ADD COLUMNS (request STRING); select *from access_logs2 where year=2016 and month=7 and day=16 and hour=21 and request!='' limit 100; All values of the request column are NULLs, obviously the request column of the table is not mapped to the request field of Parquet files. Can somebody tell me why? Many thanks in advance.
