Hello,

We are using hive 0.12. We have a large directory in S3 where we dump daily
logs of various application servers
s3://logs/20141005/machine1/log.gz
s3://logs/20141005/machine2/log.gz
s3://logs/20141006/machine1/log.gz
s3://logs/20141006/machine2/log.gz

In Hive, we have mapped it to table with date and machine-name as partition
columns. We add new partitions for all machines everyday manually by
ALTER TABLE applogs ADD PARTITION ( machine= 8, dt = '20141006') LOCATION
xyz;

Now the problem is that we have such partitions for several years for 100s
of machines which is now clogging hive metadata.

Any way to simplify this?

Note that, our current approach has these advantages:
1. partition addition does not require launching of any mapreduce job, it
just adds mapping in metadata
2. querying on specific date range and specific machine range is easily
possible. Queries on only one date and only one machine are extremely fast.

We do not want to lose these advantages. Also changing the directory
hierarchy is costly for us.

We already looked into dynamic partitioning
<https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions> but it
lauches mapreduce job to create data hierarchies. In our case, the
hierarchy is already created.
--
Regards,
Saumitra Shahapure

Reply via email to