Gopal V created HIVE-4486: ----------------------------- Summary: FetchOperator slows down SMB map joins with many files Key: HIVE-4486 URL: https://issues.apache.org/jira/browse/HIVE-4486 Project: Hive Issue Type: Bug Components: Query Processor Environment: Ubuntu LXC 12.10 Reporter: Gopal V Priority: Minor
While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { - HiveConf hiveConf = new HiveConf(job, FetchOperator.class); - boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); + boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 100000 ; {code} On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira