Which version of the Hive, and file format, are you using?
It could be either reading file footers for ORC - in recent version there’s way 
to disable that (set hive.exec.orc.split.strategy=BI); or some similar feature 
for other formats that I’m not immediately familiar with.
It could also be slow metastore calls.

From: Sreenath <sreenaths1...@gmail.com<mailto:sreenaths1...@gmail.com>>
Reply-To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
<u...@hive.apache.org<mailto:u...@hive.apache.org>>
Date: Friday, September 18, 2015 at 02:24
To: "dev@hive.apache.org<mailto:dev@hive.apache.org>" 
<dev@hive.apache.org<mailto:dev@hive.apache.org>>, 
"u...@hive.apache.org<mailto:u...@hive.apache.org>" 
<u...@hive.apache.org<mailto:u...@hive.apache.org>>
Subject: Hive Start Up Time Manifolds Greater than Execution Time

Hi All,

Something interesting fell to my notice last day when i was using hive for some 
queries. The time taken by hive to launch a mapreduce job was manifolds higher 
than the time taken by hadoop to actually execute it.
This is the table details on which the query is being fired.

CREATE EXTERNAL TABLE A
(
    user_id string,
    stage strig,
    url string
)
PARTITIONED BY (dt string , id string)

All the data for table is stored in S3 and each day there will be around 2000 
unique id i.e 2000 partitions being added daily. And we can assume that each 
partition has on a average 100MB gzip compressed data.
Now when I run a query like "SELECT DISTINCT user_id FROM A  WHERE 
dt>='20150101' and dt <= '20150401'" ie over a period of 3 months approx 60000 
partitions it takes hive approximately 2 hrs to launch the map reduce job and 
the launched job just finishes in 20 min. So was wondering if someone can help 
me in understanding what hive is doing in this 2 hrs ?
Would really appreciate some help here . Thanks in advance !!!!


Best,
Sreenath

Reply via email to