Hi, I’m using Hive 0.11 on EMR to join data from S3 and write the joined data back to it. The join working fine but I have an issue with the directory structure of the input data. The input directories are structured like:
s3://bucket/input_data/id=$ID/tblname=$TBL/dt=DT/hr=$HR This is a pretty standard use case for partitioning but the issue is that I want to load in the data for all IDs but only some of the subdirectories for an ID, i.e. s3://bucket/input_data/id=1/tblname=table/dt=2014-02-28/hr=12 s3://bucket/input_data/id=2/tblname=table/dt=2014-02-28/hr=12 s3://bucket/input_data/id=3/tblname=table/dt=2014-02-28/hr=12 s3://bucket/input_data/id=4/tblname=table/dt=2014-02-28/hr=12 and so on. I know that I can’t use dynamic partitioning in this case because according to the Hive documentation you can’t do: INSERT OVERWRITE TABLE table PARTITION(id, tblname=“table”, dt='2014-02-28', hr=12) And if I have PARTITIONED BY (token STRING, tbl STRING, dt STRING, hr INT) in my CREATE TABLE statement and do: ALTER TABLE table ADD PARTITION (tblname=“table”, dt=“2014-02-28”, hr=12) I get a SemanticException error that says, "Invalid partition spec specified table is partitioned but partition spec is not specified or does not fully match table partitioning” Is there a way to deal with this problem in Hive itself or will I have to use a script or something else in conjunction with Hive to get around this?