Gopal V created HIVE-4486:
-----------------------------
Summary: FetchOperator slows down SMB map joins with many files
Key: HIVE-4486
URL: https://issues.apache.org/jira/browse/HIVE-4486
Project: Hive
Issue Type: Bug
Components: Query Processor
Environment: Ubuntu LXC 12.10
Reporter: Gopal V
Priority: Minor
While looking at log files for SMB joins in hive, it was noticed that the
actual join op didn't show up as a significant fraction of the time spent. Most
of the time was spent parsing configuration files.
To confirm, I put log lines in the HiveConf constructor and eventually made the
following edit to the code
{code}
--- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
+++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
@@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws
HiveException {
* @return list of file status entries
*/
private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws
IOException {
- HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
- boolean recursive =
hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
+ boolean recursive = false;
if (!recursive) {
return fs.listStatus(p);
}
{code}
And re-ran my query to compare timings.
||Before||After||
|Cumulative CPU| 731.07 sec|386.0 sec|
|Total time | 347.66 seconds | 218.855 seconds |
|
The query used was
{code}INSERT OVERWRITE LOCAL DIRECTORY
'/grid/0/smb/'
select inv_item_sk
from
inventory inv
join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
limit 100000
;
{code}
On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed
into 4 buckets, with store_sales split into 7 partitions and inventory into 261
partitions.
78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs
are attached.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira