Hello, We are using hive 0.12. We have a large directory in S3 where we dump daily logs of various application servers s3://logs/20141005/machine1/log.gz s3://logs/20141005/machine2/log.gz s3://logs/20141006/machine1/log.gz s3://logs/20141006/machine2/log.gz
In Hive, we have mapped it to table with date and machine-name as partition columns. We add new partitions for all machines everyday manually by ALTER TABLE applogs ADD PARTITION ( machine= 8, dt = '20141006') LOCATION xyz; Now the problem is that we have such partitions for several years for 100s of machines which is now clogging hive metadata. Any way to simplify this? Note that, our current approach has these advantages: 1. partition addition does not require launching of any mapreduce job, it just adds mapping in metadata 2. querying on specific date range and specific machine range is easily possible. Queries on only one date and only one machine are extremely fast. We do not want to lose these advantages. Also changing the directory hierarchy is costly for us. We already looked into dynamic partitioning <https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions> but it lauches mapreduce job to create data hierarchies. In our case, the hierarchy is already created. -- Regards, Saumitra Shahapure