[jira] [Updated] (HIVE-5009) Fix minor optimization issues

2013-10-06 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated HIVE-5009:


Fix Version/s: (was: 0.12.0)

Preparing for 0.12 release. Removing fix version of 0.12 for those that are not 
in 0.12 branch.


 Fix minor optimization issues
 -

 Key: HIVE-5009
 URL: https://issues.apache.org/jira/browse/HIVE-5009
 Project: Hive
  Issue Type: Improvement
Reporter: Benjamin Jakobus
Assignee: Benjamin Jakobus
Priority: Minor
   Original Estimate: 48h
  Remaining Estimate: 48h

 I have found some minor optimization issues in the codebase, which I would 
 like to rectify and contribute. Specifically, these are:
 The optimizations that could be applied to Hive's code base are as follows:
 1. Use StringBuffer when appending strings - In 184 instances, the 
 concatination operator (+=) was used when appending strings. This is 
 inherintly inefficient - instead Java's StringBuffer or StringBuilder class 
 should be used. 12 instances of this optimization can be applied to the 
 GenMRSkewJoinProcessor class and another three to the optimizer. CliDriver 
 uses the + operator inside a loop, so does the column projection utilities 
 class (ColumnProjectionUtils) and the aforementioned skew-join processor. 
 Tests showed that using the StringBuilder when appending strings is 57\% 
 faster than using the + operator (using the StringBuffer took 122 
 milliseconds whilst the + operator took 284 milliseconds). The reason as to 
 why using the StringBuffer class is preferred over using the + operator, is 
 because
 String third = first + second;
 gets compiled to:
 StringBuilder builder = new StringBuilder( first );
 builder.append( second );
 third = builder.toString();
 Therefore, when building complex strings, that, for example involve loops, 
 require many instantiations (and as discussed below, creating new objects 
 inside loops is inefficient).
 2. Use arrays instead of List - Java's java.util.Arrays class asList method 
 is a more efficient at creating  creating lists from arrays than using loops 
 to manually iterate over the elements (using asList is computationally very 
 cheap, O(1), as it merely creates a wrapper object around the array; looping 
 through the list however has a complexity of O(n) since a new list is created 
 and every element in the array is added to this new list). As confirmed by 
 the experiment detailed in Appendix D, the Java compiler does not 
 automatically optimize and replace tight-loop copying with asList: the 
 loop-copying of 1,000,000 items took 15 milliseconds whilst using asList is 
 instant. 
 Four instances of this optimization can be applied to Hive's codebase (two of 
 these should be applied to the Map-Join container - MapJoinRowContainer) - 
 lines 92 to 98:
  for (obj = other.first(); obj != null; obj = other.next()) {
   ArrayListObject ele = new ArrayList(obj.length);
   for (int i = 0; i  obj.length; i++) {
 ele.add(obj[i]);
   }
   list.add((Row) ele);
 }
 3. Unnecessary wrapper object creation - In 31 cases, wrapper object creation 
 could be avoided by simply using the provided static conversion methods. As 
 noted in the PMD documentation, using these avoids the cost of creating 
 objects that also need to be garbage-collected later.
 For example, line 587 of the SemanticAnalyzer class, could be replaced by the 
 more efficient parseDouble method call:
 // Inefficient:
 Double percent = Double.valueOf(value).doubleValue();
 // To be replaced by:
 Double percent = Double.parseDouble(value);
 Our test case in Appendix D confirms this: converting 10,000 strings into 
 integers using Integer.parseInt(gen.nextSessionId()) (i.e. creating an 
 unnecessary wrapper object) took 119 on average; using parseInt() took only 
 38. Therefore creating even just one unnecessary wrapper object can make your 
 code up to 68% slower.
 4. Converting literals to strings using +  - Converting literals to strings 
 using +  is quite inefficient (see Appendix D) and should be done by 
 calling the toString() method instead: converting 1,000,000 integers to 
 strings using +  took, on average, 1340 milliseconds whilst using the 
 toString() method only required 1183 milliseconds (hence adding empty strings 
 takes nearly 12% more time). 
 89 instances of this using +  when converting literals were found in Hive's 
 codebase - one of these are found in the JoinUtil.
 5. Avoid manual copying of arrays - Instead of copying arrays as is done in 
 GroupByOperator on line 1040 (see below), the more efficient System.arraycopy 
 can be used (arraycopy is a native method meaning that the entire memory 
 block is copied using memcpy or mmove).
 // Line 1040 of the GroupByOperator
 for (int i = 0; i  keys.length; i++) {
   

[jira] [Updated] (HIVE-5009) Fix minor optimization issues

2013-08-12 Thread Benjamin Jakobus (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Jakobus updated HIVE-5009:
---

Attachment: (was: AbstractBucketJoinProc.java)

 Fix minor optimization issues
 -

 Key: HIVE-5009
 URL: https://issues.apache.org/jira/browse/HIVE-5009
 Project: Hive
  Issue Type: Improvement
Reporter: Benjamin Jakobus
Assignee: Benjamin Jakobus
Priority: Minor
 Fix For: 0.12.0

   Original Estimate: 48h
  Remaining Estimate: 48h

 I have found some minor optimization issues in the codebase, which I would 
 like to rectify and contribute. Specifically, these are:
 The optimizations that could be applied to Hive's code base are as follows:
 1. Use StringBuffer when appending strings - In 184 instances, the 
 concatination operator (+=) was used when appending strings. This is 
 inherintly inefficient - instead Java's StringBuffer or StringBuilder class 
 should be used. 12 instances of this optimization can be applied to the 
 GenMRSkewJoinProcessor class and another three to the optimizer. CliDriver 
 uses the + operator inside a loop, so does the column projection utilities 
 class (ColumnProjectionUtils) and the aforementioned skew-join processor. 
 Tests showed that using the StringBuilder when appending strings is 57\% 
 faster than using the + operator (using the StringBuffer took 122 
 milliseconds whilst the + operator took 284 milliseconds). The reason as to 
 why using the StringBuffer class is preferred over using the + operator, is 
 because
 String third = first + second;
 gets compiled to:
 StringBuilder builder = new StringBuilder( first );
 builder.append( second );
 third = builder.toString();
 Therefore, when building complex strings, that, for example involve loops, 
 require many instantiations (and as discussed below, creating new objects 
 inside loops is inefficient).
 2. Use arrays instead of List - Java's java.util.Arrays class asList method 
 is a more efficient at creating  creating lists from arrays than using loops 
 to manually iterate over the elements (using asList is computationally very 
 cheap, O(1), as it merely creates a wrapper object around the array; looping 
 through the list however has a complexity of O(n) since a new list is created 
 and every element in the array is added to this new list). As confirmed by 
 the experiment detailed in Appendix D, the Java compiler does not 
 automatically optimize and replace tight-loop copying with asList: the 
 loop-copying of 1,000,000 items took 15 milliseconds whilst using asList is 
 instant. 
 Four instances of this optimization can be applied to Hive's codebase (two of 
 these should be applied to the Map-Join container - MapJoinRowContainer) - 
 lines 92 to 98:
  for (obj = other.first(); obj != null; obj = other.next()) {
   ArrayListObject ele = new ArrayList(obj.length);
   for (int i = 0; i  obj.length; i++) {
 ele.add(obj[i]);
   }
   list.add((Row) ele);
 }
 3. Unnecessary wrapper object creation - In 31 cases, wrapper object creation 
 could be avoided by simply using the provided static conversion methods. As 
 noted in the PMD documentation, using these avoids the cost of creating 
 objects that also need to be garbage-collected later.
 For example, line 587 of the SemanticAnalyzer class, could be replaced by the 
 more efficient parseDouble method call:
 // Inefficient:
 Double percent = Double.valueOf(value).doubleValue();
 // To be replaced by:
 Double percent = Double.parseDouble(value);
 Our test case in Appendix D confirms this: converting 10,000 strings into 
 integers using Integer.parseInt(gen.nextSessionId()) (i.e. creating an 
 unnecessary wrapper object) took 119 on average; using parseInt() took only 
 38. Therefore creating even just one unnecessary wrapper object can make your 
 code up to 68% slower.
 4. Converting literals to strings using +  - Converting literals to strings 
 using +  is quite inefficient (see Appendix D) and should be done by 
 calling the toString() method instead: converting 1,000,000 integers to 
 strings using +  took, on average, 1340 milliseconds whilst using the 
 toString() method only required 1183 milliseconds (hence adding empty strings 
 takes nearly 12% more time). 
 89 instances of this using +  when converting literals were found in Hive's 
 codebase - one of these are found in the JoinUtil.
 5. Avoid manual copying of arrays - Instead of copying arrays as is done in 
 GroupByOperator on line 1040 (see below), the more efficient System.arraycopy 
 can be used (arraycopy is a native method meaning that the entire memory 
 block is copied using memcpy or mmove).
 // Line 1040 of the GroupByOperator
 for (int i = 0; i  keys.length; i++) {
   forwardCache[i] = keys[i];
 }   
 Using 

[jira] [Updated] (HIVE-5009) Fix minor optimization issues

2013-08-07 Thread Benjamin Jakobus (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Jakobus updated HIVE-5009:
---

Description: 
I have found some minor optimization issues in the codebase, which I would like 
to rectify and contribute. Specifically, these are:

The optimizations that could be applied to Hive's code base are as follows:

1. Use StringBuffer when appending strings - In 184 instances, the 
concatination operator (+=) was used when appending strings. This is inherintly 
inefficient - instead Java's StringBuffer or StringBuilder class should be 
used. 12 instances of this optimization can be applied to the 
GenMRSkewJoinProcessor class and another three to the optimizer. CliDriver uses 
the + operator inside a loop, so does the column projection utilities class 
(ColumnProjectionUtils) and the aforementioned skew-join processor. Tests 
showed that using the StringBuilder when appending strings is 57\% faster than 
using the + operator (using the StringBuffer took 122 milliseconds whilst the + 
operator took 284 milliseconds). The reason as to why using the StringBuffer 
class is preferred over using the + operator, is because

String third = first + second;

gets compiled to:

StringBuilder builder = new StringBuilder( first );
builder.append( second );
third = builder.toString();

Therefore, when building complex strings, that, for example involve loops, 
require many instantiations (and as discussed below, creating new objects 
inside loops is inefficient).


2. Use arrays instead of List - Java's java.util.Arrays class asList method is 
a more efficient at creating  creating lists from arrays than using loops to 
manually iterate over the elements (using asList is computationally very cheap, 
O(1), as it merely creates a wrapper object around the array; looping through 
the list however has a complexity of O(n) since a new list is created and every 
element in the array is added to this new list). As confirmed by the experiment 
detailed in Appendix D, the Java compiler does not automatically optimize and 
replace tight-loop copying with asList: the loop-copying of 1,000,000 items 
took 15 milliseconds whilst using asList is instant. 

Four instances of this optimization can be applied to Hive's codebase (two of 
these should be applied to the Map-Join container - MapJoinRowContainer) - 
lines 92 to 98:

 for (obj = other.first(); obj != null; obj = other.next()) {
  ArrayListObject ele = new ArrayList(obj.length);
  for (int i = 0; i  obj.length; i++) {
ele.add(obj[i]);
  }
  list.add((Row) ele);
}


3. Unnecessary wrapper object creation - In 31 cases, wrapper object creation 
could be avoided by simply using the provided static conversion methods. As 
noted in the PMD documentation, using these avoids the cost of creating 
objects that also need to be garbage-collected later.

For example, line 587 of the SemanticAnalyzer class, could be replaced by the 
more efficient parseDouble method call:

// Inefficient:
Double percent = Double.valueOf(value).doubleValue();
// To be replaced by:
Double percent = Double.parseDouble(value);


Our test case in Appendix D confirms this: converting 10,000 strings into 
integers using Integer.parseInt(gen.nextSessionId()) (i.e. creating an 
unnecessary wrapper object) took 119 on average; using parseInt() took only 38. 
Therefore creating even just one unnecessary wrapper object can make your code 
up to 68% slower.

4. Converting literals to strings using +  - Converting literals to strings 
using +  is quite inefficient (see Appendix D) and should be done by calling 
the toString() method instead: converting 1,000,000 integers to strings using + 
 took, on average, 1340 milliseconds whilst using the toString() method only 
required 1183 milliseconds (hence adding empty strings takes nearly 12% more 
time). 

89 instances of this using +  when converting literals were found in Hive's 
codebase - one of these are found in the JoinUtil.

5. Avoid manual copying of arrays - Instead of copying arrays as is done in 
GroupByOperator on line 1040 (see below), the more efficient System.arraycopy 
can be used (arraycopy is a native method meaning that the entire memory block 
is copied using memcpy or mmove).

// Line 1040 of the GroupByOperator
for (int i = 0; i  keys.length; i++) {
forwardCache[i] = keys[i];
}   

Using System.arraycopy on an array of 10,000 strings was (close to) instant 
whilst the manual copy took 6 milliseconds.
11 instances of this optimization should be applied to the Hive codebase.

6. Avoiding instantiation inside loops - As noted in the PMD documentation, 
new objects created within loops should be checked to see if they can created 
outside them and reused.. 

Declaring variables inside a loop (i from 0 to 10,000) took 300 milliseconds
whilst declaring them outside took only 88 milliseconds (this can be 

[jira] [Updated] (HIVE-5009) Fix minor optimization issues

2013-08-07 Thread Benjamin Jakobus (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Jakobus updated HIVE-5009:
---

Attachment: AbstractBucketJoinProc.java

 Fix minor optimization issues
 -

 Key: HIVE-5009
 URL: https://issues.apache.org/jira/browse/HIVE-5009
 Project: Hive
  Issue Type: Improvement
Reporter: Benjamin Jakobus
Assignee: Benjamin Jakobus
Priority: Minor
 Fix For: 0.12.0

 Attachments: AbstractBucketJoinProc.java

   Original Estimate: 48h
  Remaining Estimate: 48h

 I have found some minor optimization issues in the codebase, which I would 
 like to rectify and contribute. Specifically, these are:
 The optimizations that could be applied to Hive's code base are as follows:
 1. Use StringBuffer when appending strings - In 184 instances, the 
 concatination operator (+=) was used when appending strings. This is 
 inherintly inefficient - instead Java's StringBuffer or StringBuilder class 
 should be used. 12 instances of this optimization can be applied to the 
 GenMRSkewJoinProcessor class and another three to the optimizer. CliDriver 
 uses the + operator inside a loop, so does the column projection utilities 
 class (ColumnProjectionUtils) and the aforementioned skew-join processor. 
 Tests showed that using the StringBuilder when appending strings is 57\% 
 faster than using the + operator (using the StringBuffer took 122 
 milliseconds whilst the + operator took 284 milliseconds). The reason as to 
 why using the StringBuffer class is preferred over using the + operator, is 
 because
 String third = first + second;
 gets compiled to:
 StringBuilder builder = new StringBuilder( first );
 builder.append( second );
 third = builder.toString();
 Therefore, when building complex strings, that, for example involve loops, 
 require many instantiations (and as discussed below, creating new objects 
 inside loops is inefficient).
 2. Use arrays instead of List - Java's java.util.Arrays class asList method 
 is a more efficient at creating  creating lists from arrays than using loops 
 to manually iterate over the elements (using asList is computationally very 
 cheap, O(1), as it merely creates a wrapper object around the array; looping 
 through the list however has a complexity of O(n) since a new list is created 
 and every element in the array is added to this new list). As confirmed by 
 the experiment detailed in Appendix D, the Java compiler does not 
 automatically optimize and replace tight-loop copying with asList: the 
 loop-copying of 1,000,000 items took 15 milliseconds whilst using asList is 
 instant. 
 Four instances of this optimization can be applied to Hive's codebase (two of 
 these should be applied to the Map-Join container - MapJoinRowContainer) - 
 lines 92 to 98:
  for (obj = other.first(); obj != null; obj = other.next()) {
   ArrayListObject ele = new ArrayList(obj.length);
   for (int i = 0; i  obj.length; i++) {
 ele.add(obj[i]);
   }
   list.add((Row) ele);
 }
 3. Unnecessary wrapper object creation - In 31 cases, wrapper object creation 
 could be avoided by simply using the provided static conversion methods. As 
 noted in the PMD documentation, using these avoids the cost of creating 
 objects that also need to be garbage-collected later.
 For example, line 587 of the SemanticAnalyzer class, could be replaced by the 
 more efficient parseDouble method call:
 // Inefficient:
 Double percent = Double.valueOf(value).doubleValue();
 // To be replaced by:
 Double percent = Double.parseDouble(value);
 Our test case in Appendix D confirms this: converting 10,000 strings into 
 integers using Integer.parseInt(gen.nextSessionId()) (i.e. creating an 
 unnecessary wrapper object) took 119 on average; using parseInt() took only 
 38. Therefore creating even just one unnecessary wrapper object can make your 
 code up to 68% slower.
 4. Converting literals to strings using +  - Converting literals to strings 
 using +  is quite inefficient (see Appendix D) and should be done by 
 calling the toString() method instead: converting 1,000,000 integers to 
 strings using +  took, on average, 1340 milliseconds whilst using the 
 toString() method only required 1183 milliseconds (hence adding empty strings 
 takes nearly 12% more time). 
 89 instances of this using +  when converting literals were found in Hive's 
 codebase - one of these are found in the JoinUtil.
 5. Avoid manual copying of arrays - Instead of copying arrays as is done in 
 GroupByOperator on line 1040 (see below), the more efficient System.arraycopy 
 can be used (arraycopy is a native method meaning that the entire memory 
 block is copied using memcpy or mmove).
 // Line 1040 of the GroupByOperator
 for (int i = 0; i  keys.length; i++) {
   forwardCache[i] 

[jira] [Updated] (HIVE-5009) Fix minor optimization issues

2013-08-06 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated HIVE-5009:
---

Assignee: Benjamin Jakobus

 Fix minor optimization issues
 -

 Key: HIVE-5009
 URL: https://issues.apache.org/jira/browse/HIVE-5009
 Project: Hive
  Issue Type: Improvement
Reporter: Benjamin Jakobus
Assignee: Benjamin Jakobus
Priority: Minor
 Fix For: 0.12.0

   Original Estimate: 48h
  Remaining Estimate: 48h

 I have found some minor optimization issues in the codebase, which I would 
 like to rectify and contribute. Specifically, these are:
 The optimizations that could be applied to Hive's code base are as follows:
 1. Use StringBuffer when appending strings - In 184 instances, the 
 concatination operator (+=) was used when appending strings. This is 
 inherintly inefficient - instead Java's StringBuffer or StringBuilder class 
 should be used. 12 instances of this optimization can be applied to the 
 GenMRSkewJoinProcessor class and another three to the optimizer. CliDriver 
 uses the + operator inside a loop, so does the column projection utilities 
 class (ColumnProjectionUtils) and the aforementioned skew-join processor. 
 Tests showed that using the StringBuilder when appending strings is 57\% 
 faster than using the + operator (using the StringBuffer took 122 
 milliseconds whilst the + operator took 284 milliseconds). The reason as to 
 why using the StringBuffer class is preferred over using the + operator, is 
 because
 String third = first + second;
 gets compiled to:
 StringBuilder builder = new StringBuilder( first );
 builder.append( second );
 third = builder.toString();
 Therefore, when building complex strings, that, for example involve loops, 
 require many instantiations (and as discussed below, creating new objects 
 inside loops is inefficient).
 2. Use arrays instead of List - Java's java.util.Arrays class asList method 
 is a more efficient at creating  creating lists from arrays than using loops 
 to manually iterate over the elements (using asList is computationally very 
 cheap, O(1), as it merely creates a wrapper object around the array; looping 
 through the list however has a complexity of O(n) since a new list is created 
 and every element in the array is added to this new list). As confirmed by 
 the experiment detailed in Appendix D, the Java compiler does not 
 automatically optimize and replace tight-loop copying with asList: the 
 loop-copying of 1,000,000 items took 15 milliseconds whilst using asList is 
 instant. 
 Four instances of this optimization can be applied to Hive's codebase (two of 
 these should be applied to the Map-Join container - MapJoinRowContainer) - 
 lines 92 to 98:
  for (obj = other.first(); obj != null; obj = other.next()) {
   ArrayListObject ele = new ArrayList(obj.length);
   for (int i = 0; i  obj.length; i++) {
 ele.add(obj[i]);
   }
   list.add((Row) ele);
 }
 3. Unnecessary wrapper object creation - In 31 cases, wrapper object creation 
 could be avoided by simply using the provided static conversion methods. As 
 noted in the PMD documentation, using these avoids the cost of creating 
 objects that also need to be garbage-collected later.
 For example, line 587 of the SemanticAnalyzer class, could be replaced by the 
 more efficient parseDouble method call:
 // Inefficient:
 Double percent = Double.valueOf(value).doubleValue();
 // To be replaced by:
 Double percent = Double.parseDouble(value);
 Our test case in Appendix D confirms this: converting 10,000 strings into 
 integers using Integer.parseInt(gen.nextSessionId()) (i.e. creating an 
 unnecessary wrapper object) took 119 on average; using parseInt() took only 
 38. Therefore creating even just one unnecessary wrapper object can make your 
 code up to 68% slower.
 4. Converting literals to strings using +  - Converting literals to strings 
 using +  is quite inefficient (see Appendix D) and should be done by 
 calling the toString() method instead: converting 1,000,000 integers to 
 strings using +  took, on average, 1340 milliseconds whilst using the 
 toString() method only required 1183 milliseconds (hence adding empty strings 
 takes nearly 12% more time). 
 89 instances of this using +  when converting literals were found in Hive's 
 codebase - one of these are found in the JoinUtil.
 5. Avoid manual copying of arrays - Instead of copying arrays as is done in 
 GroupByOperator on line 1040 (see below), the more efficient System.arraycopy 
 can be used (arraycopy is a native method meaning that the entire memory 
 block is copied using memcpy or mmove).
 // Line 1040 of the GroupByOperator
 for (int i = 0; i  keys.length; i++) {
   forwardCache[i] = keys[i];
 }   
 Using System.arraycopy on an array of 10,000