[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions
[ https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4486: Resolution: Fixed Fix Version/s: 0.12.0 0.11.1 Assignee: Gopal V Status: Resolved (was: Patch Available) I just committed this to branch-0.11 and trunk. Thanks, Gopal! FetchOperator slows down SMB map joins by 50% when there are many partitions Key: HIVE-4486 URL: https://issues.apache.org/jira/browse/HIVE-4486 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.12.0 Environment: Ubuntu LXC 12.10 Reporter: Gopal V Assignee: Gopal V Priority: Minor Fix For: 0.11.1, 0.12.0 Attachments: HIVE-4486.patch, smb-profile.html While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { -HiveConf hiveConf = new HiveConf(job, FetchOperator.class); -boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); +boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. || ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 10 ; {code} On a scale=2 tpcds data-set, where both store_sales inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions
[ https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-4486: -- Attachment: HIVE-4486.patch Patch based on Navis' suggestions. FetchOperator slows down SMB map joins by 50% when there are many partitions Key: HIVE-4486 URL: https://issues.apache.org/jira/browse/HIVE-4486 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.12.0 Environment: Ubuntu LXC 12.10 Reporter: Gopal V Priority: Minor Attachments: HIVE-4486.patch, smb-profile.html While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { -HiveConf hiveConf = new HiveConf(job, FetchOperator.class); -boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); +boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. || ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 10 ; {code} On a scale=2 tpcds data-set, where both store_sales inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions
[ https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-4486: -- Release Note: Avoid creating new HiveConf() within the FetchOperator loop Status: Patch Available (was: Open) FetchOperator slows down SMB map joins by 50% when there are many partitions Key: HIVE-4486 URL: https://issues.apache.org/jira/browse/HIVE-4486 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.12.0 Environment: Ubuntu LXC 12.10 Reporter: Gopal V Priority: Minor Attachments: HIVE-4486.patch, smb-profile.html While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { -HiveConf hiveConf = new HiveConf(job, FetchOperator.class); -boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); +boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. || ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 10 ; {code} On a scale=2 tpcds data-set, where both store_sales inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions
[ https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-4486: -- Summary: FetchOperator slows down SMB map joins by 50% when there are many partitions (was: FetchOperator slows down SMB map joins with many files) FetchOperator slows down SMB map joins by 50% when there are many partitions Key: HIVE-4486 URL: https://issues.apache.org/jira/browse/HIVE-4486 Project: Hive Issue Type: Bug Components: Query Processor Environment: Ubuntu LXC 12.10 Reporter: Gopal V Priority: Minor While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { -HiveConf hiveConf = new HiveConf(job, FetchOperator.class); -boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); +boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 10 ; {code} On a scale=2 tpcds data-set, where both store_sales inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions
[ https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-4486: -- Attachment: smb-profile.html attach yourkit profile (HTML) FetchOperator slows down SMB map joins by 50% when there are many partitions Key: HIVE-4486 URL: https://issues.apache.org/jira/browse/HIVE-4486 Project: Hive Issue Type: Bug Components: Query Processor Environment: Ubuntu LXC 12.10 Reporter: Gopal V Priority: Minor Attachments: smb-profile.html While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { -HiveConf hiveConf = new HiveConf(job, FetchOperator.class); -boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); +boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 10 ; {code} On a scale=2 tpcds data-set, where both store_sales inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions
[ https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-4486: -- Description: While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { -HiveConf hiveConf = new HiveConf(job, FetchOperator.class); -boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); +boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. || ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 10 ; {code} On a scale=2 tpcds data-set, where both store_sales inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. was: While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { -HiveConf hiveConf = new HiveConf(job, FetchOperator.class); -boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); +boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 10 ; {code} On a scale=2 tpcds data-set, where both store_sales inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. FetchOperator slows down SMB map joins by 50% when there are many partitions Key: HIVE-4486 URL: https://issues.apache.org/jira/browse/HIVE-4486 Project: Hive Issue Type: Bug Components: Query Processor Environment: Ubuntu LXC 12.10 Reporter: Gopal V Priority: Minor Attachments: smb-profile.html While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { -HiveConf hiveConf = new HiveConf(job, FetchOperator.class); -boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); +boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. || ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was
[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions
[ https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-4486: -- Affects Version/s: 0.12.0 FetchOperator slows down SMB map joins by 50% when there are many partitions Key: HIVE-4486 URL: https://issues.apache.org/jira/browse/HIVE-4486 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.12.0 Environment: Ubuntu LXC 12.10 Reporter: Gopal V Priority: Minor Attachments: smb-profile.html While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { -HiveConf hiveConf = new HiveConf(job, FetchOperator.class); -boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); +boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. || ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 10 ; {code} On a scale=2 tpcds data-set, where both store_sales inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira