Hi, >From looking at the hive log output, it seems that my job is accessing many more partitions than it needs to? For example, my query is something like:
INSERT OVERWRITE TABLE daily_fact PARTITION (dt='2010-09-29') SELECT 20100929 as stamp, tagtype, country, sum(num_requests) AS num_requests FROM hourly_fact HF WHERE (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) GROUP BY 20100929, tagtype, country Based on the WHERE clause, I would expect it to look only at partitions in the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log contains entries like: 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file > hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 > And many other hours outside my WHERE constraint. I assume this means that it's processing those directories. The answer still comes out right, but I'm concerned about the performance. Would appreciate some help understanding what this means and how to fix it. Thanks, Marc