Loading and partitioning data when root partition value is variable

Mike Stec Fri, 28 Feb 2014 11:06:31 -0800

Hi,

I’m using Hive 0.11 on EMR to join data from S3 and write the joined data back 
to it. The join working fine but I have an issue with the directory structure 
of the input data. The input directories are structured like:


s3://bucket/input_data/id=$ID/tblname=$TBL/dt=DT/hr=$HR

This is a pretty standard use case for partitioning but the issue is that I 
want to load in the data for all IDs but only some of the subdirectories for an 
ID, i.e.

s3://bucket/input_data/id=1/tblname=table/dt=2014-02-28/hr=12
s3://bucket/input_data/id=2/tblname=table/dt=2014-02-28/hr=12
s3://bucket/input_data/id=3/tblname=table/dt=2014-02-28/hr=12
s3://bucket/input_data/id=4/tblname=table/dt=2014-02-28/hr=12

and so on. 

I know that I can’t use dynamic partitioning in this case because according to 
the Hive documentation you can’t do:

INSERT OVERWRITE TABLE table PARTITION(id, tblname=“table”, dt='2014-02-28', 
hr=12)

And if I have PARTITIONED BY (token STRING, tbl STRING, dt STRING, hr INT) in 
my CREATE TABLE statement and do:

ALTER TABLE table ADD PARTITION (tblname=“table”, dt=“2014-02-28”, hr=12)

I get a SemanticException error that says, "Invalid partition spec specified 
table is partitioned but partition spec is not specified or does not fully 
match table partitioning”

Is there a way to deal with this problem in Hive itself or will I have to use a 
script or something else in conjunction with Hive to get around this?

Loading and partitioning data when root partition value is variable

Reply via email to