Hi hive masters,
I have a hive table that is partitioned across two fields say
date and id. For each date i have entries for around 1400 ids ie on a daily
basis i add 1400 partitions to the table. The actual data is lying in
amazon s3. Now the issue is if i fire a query over the table across all id
for a month just for launching the map reduce job its takes around 2 hrs. I
tried analyzing the logs and realized that this lag was spent in the
function org.apache.hadoop.hive.ql.exec.ExecDriver.addInputpaths() where it
emits the following logs "Adding input file
*s3://raw-data/Analyze/2013/12/01/995*". Here i'm guessing its doing
something as checking whether that directory is empty or not in s3.
Is there any way i can edit the function to avoid this lag maybe like
picking the directory names from a local disk ?
Regards
--
Sreenath S Kamath
Bangalore
Ph No:+91-9590989106