Hi Namit, Hourly_fact is partitioned on dt and hr.
Marc On Oct 3, 2010 10:00 PM, "Namit Jain" <nj...@facebook.com> wrote: > What is your table hourly_fact partitioned on ? > > ________________________________________ > From: Marc Limotte [mslimo...@gmail.com] > Sent: Friday, October 01, 2010 2:10 PM > To: hive-user@hadoop.apache.org > Subject: hive query doesn't seem to limit itself to partitions based on the WHERE clause > > Hi, > > From looking at the hive log output, it seems that my job is accessing many more partitions than it needs to? For example, my query is something like: > > INSERT OVERWRITE TABLE daily_fact > PARTITION (dt='2010-09-29') > SELECT > 20100929 as stamp, > tagtype, > country, > sum(num_requests) AS num_requests > FROM > hourly_fact HF > WHERE > (HF.dt = '2010-09-29' AND HF.hr > '07' ) > OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) > GROUP BY > 20100929, tagtype, country > > Based on the WHERE clause, I would expect it to look only at partitions in the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log contains entries like: > > 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 > > And many other hours outside my WHERE constraint. I assume this means that it's processing those directories. The answer still comes out right, but I'm concerned about the performance. > > Would appreciate some help understanding what this means and how to fix it. > > Thanks, > Marc > >