RE: Performance Issues in Hive with S3 and Partitions

richin.jain Tue, 24 Jul 2012 08:48:31 -0700

Hi Igor,

Thanks for the response. Yes I am using EMR.
I will make changes and let you know if that helps.

Richin

From: ext Igor Tatarinov [mailto:[email protected]]
Sent: Tuesday, July 24, 2012 12:38 AM
To: [email protected]
Subject: Re: Performance Issues in Hive with S3 and Partitions

Are you using EMR?
Have you tried  setting
Hive.optimize.s3.query=true

as mentioned in
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html

I haven't tried using that option myself. I am curious if it helps in your 
scenario. The above page also mentions another fix that's supposed to help with 
partitioned tables. Optimizing queries with thousands of input files used to 
take a lot of time. But it looks like that fix is enabled by default now.

Just in case, also check your jvm reuse option. If it's too low, performance 
will suffer. I had it set to 3 to avoid running out of memory. Using the 
default value of 20 really helps when reading lots of small files.

igor
decide.com<http://decide.com>
On Mon, Jul 23, 2012 at 8:33 PM, 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

Sorry this is an AWS Hive Specific question.  I have two External Hive tables 
for my custom logs.

1. flat directory structure on AWS S3, no partition and files in bz2 compressed 
format (few big files)

2. With 3 level of partitions on AWS S3 (lot of small uncompressed files)

I noticed that my queries on the table with Partition is taking forever to run. 
The same queries run fine and finish up quickly on table with no partition.
Am I missing something, I suspect this has something to do with the way S3 
behaves.

A query example is :

select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - 
min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60)
from logs
group by id;

Thanks,
Richin

RE: Performance Issues in Hive with S3 and Partitions

Reply via email to