Are you using EMR? Have you tried setting Hive.optimize.s3.query=true as mentioned in http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html
I haven't tried using that option myself. I am curious if it helps in your scenario. The above page also mentions another fix that's supposed to help with partitioned tables. Optimizing queries with thousands of input files used to take a lot of time. But it looks like that fix is enabled by default now. Just in case, also check your jvm reuse option. If it's too low, performance will suffer. I had it set to 3 to avoid running out of memory. Using the default value of 20 really helps when reading lots of small files. igor decide.com On Mon, Jul 23, 2012 at 8:33 PM, <richin.j...@nokia.com> wrote: > Hi, **** > > ** ** > > Sorry this is an AWS Hive Specific question. I have two External Hive > tables for my custom logs. **** > > ** ** > > 1. flat directory structure on AWS S3, no partition and files in bz2 > compressed format (few big files)**** > > ** ** > > 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files)* > *** > > ** ** > > I noticed that my queries on the table with Partition is taking forever to > run. The same queries run fine and finish up quickly on table with no > partition. **** > > Am I missing something, I suspect this has something to do with the way S3 > behaves.**** > > ** ** > > A query example is :**** > > ** ** > > select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - > min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60)**** > > from logs **** > > group by id; **** > > ** ** > > Thanks,**** > > Richin**** >