[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions

2013-05-17 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4486:


   Resolution: Fixed
Fix Version/s: 0.12.0
   0.11.1
 Assignee: Gopal V
   Status: Resolved  (was: Patch Available)

I just committed this to branch-0.11 and trunk. Thanks, Gopal!

 FetchOperator slows down SMB map joins by 50% when there are many partitions
 

 Key: HIVE-4486
 URL: https://issues.apache.org/jira/browse/HIVE-4486
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.12.0
 Environment: Ubuntu LXC 12.10
Reporter: Gopal V
Assignee: Gopal V
Priority: Minor
 Fix For: 0.11.1, 0.12.0

 Attachments: HIVE-4486.patch, smb-profile.html


 While looking at log files for SMB joins in hive, it was noticed that the 
 actual join op didn't show up as a significant fraction of the time spent. 
 Most of the time was spent parsing configuration files.
 To confirm, I put log lines in the HiveConf constructor and eventually made 
 the following edit to the code
 {code}
 --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
 HiveException {
 * @return list of file status entries
 */
private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
 IOException {
 -HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
 -boolean recursive = 
 hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
 +boolean recursive = false;
  if (!recursive) {
return fs.listStatus(p);
  }
 {code}
 And re-ran my query to compare timings.
 || ||Before||After||
 |Cumulative CPU| 731.07 sec|386.0 sec|
 |Total time | 347.66 seconds | 218.855 seconds | 
 |
 The query used was 
 {code}INSERT OVERWRITE LOCAL DIRECTORY
 '/grid/0/smb/'
 select inv_item_sk
 from
  inventory inv
  join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
 limit 10
 ;
 {code}
 On a scale=2 tpcds data-set, where both store_sales  inventory are bucketed 
 into 4 buckets, with store_sales split into 7 partitions and inventory into 
 261 partitions.
 78% of all CPU time was spent within new HiveConf(). The yourkit profiler 
 runs are attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions

2013-05-10 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-4486:
--

Attachment: HIVE-4486.patch

Patch based on Navis' suggestions.

 FetchOperator slows down SMB map joins by 50% when there are many partitions
 

 Key: HIVE-4486
 URL: https://issues.apache.org/jira/browse/HIVE-4486
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.12.0
 Environment: Ubuntu LXC 12.10
Reporter: Gopal V
Priority: Minor
 Attachments: HIVE-4486.patch, smb-profile.html


 While looking at log files for SMB joins in hive, it was noticed that the 
 actual join op didn't show up as a significant fraction of the time spent. 
 Most of the time was spent parsing configuration files.
 To confirm, I put log lines in the HiveConf constructor and eventually made 
 the following edit to the code
 {code}
 --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
 HiveException {
 * @return list of file status entries
 */
private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
 IOException {
 -HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
 -boolean recursive = 
 hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
 +boolean recursive = false;
  if (!recursive) {
return fs.listStatus(p);
  }
 {code}
 And re-ran my query to compare timings.
 || ||Before||After||
 |Cumulative CPU| 731.07 sec|386.0 sec|
 |Total time | 347.66 seconds | 218.855 seconds | 
 |
 The query used was 
 {code}INSERT OVERWRITE LOCAL DIRECTORY
 '/grid/0/smb/'
 select inv_item_sk
 from
  inventory inv
  join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
 limit 10
 ;
 {code}
 On a scale=2 tpcds data-set, where both store_sales  inventory are bucketed 
 into 4 buckets, with store_sales split into 7 partitions and inventory into 
 261 partitions.
 78% of all CPU time was spent within new HiveConf(). The yourkit profiler 
 runs are attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions

2013-05-10 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-4486:
--

Release Note: Avoid creating new HiveConf() within the FetchOperator loop
  Status: Patch Available  (was: Open)

 FetchOperator slows down SMB map joins by 50% when there are many partitions
 

 Key: HIVE-4486
 URL: https://issues.apache.org/jira/browse/HIVE-4486
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.12.0
 Environment: Ubuntu LXC 12.10
Reporter: Gopal V
Priority: Minor
 Attachments: HIVE-4486.patch, smb-profile.html


 While looking at log files for SMB joins in hive, it was noticed that the 
 actual join op didn't show up as a significant fraction of the time spent. 
 Most of the time was spent parsing configuration files.
 To confirm, I put log lines in the HiveConf constructor and eventually made 
 the following edit to the code
 {code}
 --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
 HiveException {
 * @return list of file status entries
 */
private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
 IOException {
 -HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
 -boolean recursive = 
 hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
 +boolean recursive = false;
  if (!recursive) {
return fs.listStatus(p);
  }
 {code}
 And re-ran my query to compare timings.
 || ||Before||After||
 |Cumulative CPU| 731.07 sec|386.0 sec|
 |Total time | 347.66 seconds | 218.855 seconds | 
 |
 The query used was 
 {code}INSERT OVERWRITE LOCAL DIRECTORY
 '/grid/0/smb/'
 select inv_item_sk
 from
  inventory inv
  join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
 limit 10
 ;
 {code}
 On a scale=2 tpcds data-set, where both store_sales  inventory are bucketed 
 into 4 buckets, with store_sales split into 7 partitions and inventory into 
 261 partitions.
 78% of all CPU time was spent within new HiveConf(). The yourkit profiler 
 runs are attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions

2013-05-03 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-4486:
--

Summary: FetchOperator slows down SMB map joins by 50% when there are many 
partitions  (was: FetchOperator slows down SMB map joins with many files)

 FetchOperator slows down SMB map joins by 50% when there are many partitions
 

 Key: HIVE-4486
 URL: https://issues.apache.org/jira/browse/HIVE-4486
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
 Environment: Ubuntu LXC 12.10
Reporter: Gopal V
Priority: Minor

 While looking at log files for SMB joins in hive, it was noticed that the 
 actual join op didn't show up as a significant fraction of the time spent. 
 Most of the time was spent parsing configuration files.
 To confirm, I put log lines in the HiveConf constructor and eventually made 
 the following edit to the code
 {code}
 --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
 HiveException {
 * @return list of file status entries
 */
private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
 IOException {
 -HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
 -boolean recursive = 
 hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
 +boolean recursive = false;
  if (!recursive) {
return fs.listStatus(p);
  }
 {code}
 And re-ran my query to compare timings.
 ||Before||After||
 |Cumulative CPU| 731.07 sec|386.0 sec|
 |Total time | 347.66 seconds | 218.855 seconds | 
 |
 The query used was 
 {code}INSERT OVERWRITE LOCAL DIRECTORY
 '/grid/0/smb/'
 select inv_item_sk
 from
  inventory inv
  join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
 limit 10
 ;
 {code}
 On a scale=2 tpcds data-set, where both store_sales  inventory are bucketed 
 into 4 buckets, with store_sales split into 7 partitions and inventory into 
 261 partitions.
 78% of all CPU time was spent within new HiveConf(). The yourkit profiler 
 runs are attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions

2013-05-03 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-4486:
--

Attachment: smb-profile.html

attach yourkit profile (HTML)

 FetchOperator slows down SMB map joins by 50% when there are many partitions
 

 Key: HIVE-4486
 URL: https://issues.apache.org/jira/browse/HIVE-4486
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
 Environment: Ubuntu LXC 12.10
Reporter: Gopal V
Priority: Minor
 Attachments: smb-profile.html


 While looking at log files for SMB joins in hive, it was noticed that the 
 actual join op didn't show up as a significant fraction of the time spent. 
 Most of the time was spent parsing configuration files.
 To confirm, I put log lines in the HiveConf constructor and eventually made 
 the following edit to the code
 {code}
 --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
 HiveException {
 * @return list of file status entries
 */
private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
 IOException {
 -HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
 -boolean recursive = 
 hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
 +boolean recursive = false;
  if (!recursive) {
return fs.listStatus(p);
  }
 {code}
 And re-ran my query to compare timings.
 ||Before||After||
 |Cumulative CPU| 731.07 sec|386.0 sec|
 |Total time | 347.66 seconds | 218.855 seconds | 
 |
 The query used was 
 {code}INSERT OVERWRITE LOCAL DIRECTORY
 '/grid/0/smb/'
 select inv_item_sk
 from
  inventory inv
  join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
 limit 10
 ;
 {code}
 On a scale=2 tpcds data-set, where both store_sales  inventory are bucketed 
 into 4 buckets, with store_sales split into 7 partitions and inventory into 
 261 partitions.
 78% of all CPU time was spent within new HiveConf(). The yourkit profiler 
 runs are attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions

2013-05-03 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-4486:
--

Description: 
While looking at log files for SMB joins in hive, it was noticed that the 
actual join op didn't show up as a significant fraction of the time spent. Most 
of the time was spent parsing configuration files.

To confirm, I put log lines in the HiveConf constructor and eventually made the 
following edit to the code

{code}
--- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
+++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
@@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
HiveException {
* @return list of file status entries
*/
   private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
IOException {
-HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
-boolean recursive = 
hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
+boolean recursive = false;
 if (!recursive) {
   return fs.listStatus(p);
 }
{code}

And re-ran my query to compare timings.

|| ||Before||After||
|Cumulative CPU| 731.07 sec|386.0 sec|
|Total time | 347.66 seconds | 218.855 seconds | 
|

The query used was 

{code}INSERT OVERWRITE LOCAL DIRECTORY
'/grid/0/smb/'
select inv_item_sk
from
 inventory inv
 join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
limit 10
;
{code}

On a scale=2 tpcds data-set, where both store_sales  inventory are bucketed 
into 4 buckets, with store_sales split into 7 partitions and inventory into 261 
partitions.

78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs 
are attached.

  was:
While looking at log files for SMB joins in hive, it was noticed that the 
actual join op didn't show up as a significant fraction of the time spent. Most 
of the time was spent parsing configuration files.

To confirm, I put log lines in the HiveConf constructor and eventually made the 
following edit to the code

{code}
--- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
+++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
@@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
HiveException {
* @return list of file status entries
*/
   private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
IOException {
-HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
-boolean recursive = 
hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
+boolean recursive = false;
 if (!recursive) {
   return fs.listStatus(p);
 }
{code}

And re-ran my query to compare timings.

||Before||After||
|Cumulative CPU| 731.07 sec|386.0 sec|
|Total time | 347.66 seconds | 218.855 seconds | 
|

The query used was 

{code}INSERT OVERWRITE LOCAL DIRECTORY
'/grid/0/smb/'
select inv_item_sk
from
 inventory inv
 join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
limit 10
;
{code}

On a scale=2 tpcds data-set, where both store_sales  inventory are bucketed 
into 4 buckets, with store_sales split into 7 partitions and inventory into 261 
partitions.

78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs 
are attached.


 FetchOperator slows down SMB map joins by 50% when there are many partitions
 

 Key: HIVE-4486
 URL: https://issues.apache.org/jira/browse/HIVE-4486
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
 Environment: Ubuntu LXC 12.10
Reporter: Gopal V
Priority: Minor
 Attachments: smb-profile.html


 While looking at log files for SMB joins in hive, it was noticed that the 
 actual join op didn't show up as a significant fraction of the time spent. 
 Most of the time was spent parsing configuration files.
 To confirm, I put log lines in the HiveConf constructor and eventually made 
 the following edit to the code
 {code}
 --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
 HiveException {
 * @return list of file status entries
 */
private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
 IOException {
 -HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
 -boolean recursive = 
 hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
 +boolean recursive = false;
  if (!recursive) {
return fs.listStatus(p);
  }
 {code}
 And re-ran my query to compare timings.
 || ||Before||After||
 |Cumulative CPU| 731.07 sec|386.0 sec|
 |Total time | 347.66 seconds | 218.855 seconds | 
 |
 The query used was 
 

[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions

2013-05-03 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-4486:
--

Affects Version/s: 0.12.0

 FetchOperator slows down SMB map joins by 50% when there are many partitions
 

 Key: HIVE-4486
 URL: https://issues.apache.org/jira/browse/HIVE-4486
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.12.0
 Environment: Ubuntu LXC 12.10
Reporter: Gopal V
Priority: Minor
 Attachments: smb-profile.html


 While looking at log files for SMB joins in hive, it was noticed that the 
 actual join op didn't show up as a significant fraction of the time spent. 
 Most of the time was spent parsing configuration files.
 To confirm, I put log lines in the HiveConf constructor and eventually made 
 the following edit to the code
 {code}
 --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
 @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
 HiveException {
 * @return list of file status entries
 */
private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
 IOException {
 -HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
 -boolean recursive = 
 hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
 +boolean recursive = false;
  if (!recursive) {
return fs.listStatus(p);
  }
 {code}
 And re-ran my query to compare timings.
 || ||Before||After||
 |Cumulative CPU| 731.07 sec|386.0 sec|
 |Total time | 347.66 seconds | 218.855 seconds | 
 |
 The query used was 
 {code}INSERT OVERWRITE LOCAL DIRECTORY
 '/grid/0/smb/'
 select inv_item_sk
 from
  inventory inv
  join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
 limit 10
 ;
 {code}
 On a scale=2 tpcds data-set, where both store_sales  inventory are bucketed 
 into 4 buckets, with store_sales split into 7 partitions and inventory into 
 261 partitions.
 78% of all CPU time was spent within new HiveConf(). The yourkit profiler 
 runs are attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira