[
https://issues.apache.org/jira/browse/HIVE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thejas M Nair updated HIVE-5009:
--------------------------------
Fix Version/s: (was: 0.12.0)
Preparing for 0.12 release. Removing fix version of 0.12 for those that are not
in 0.12 branch.
> Fix minor optimization issues
> -----------------------------
>
> Key: HIVE-5009
> URL: https://issues.apache.org/jira/browse/HIVE-5009
> Project: Hive
> Issue Type: Improvement
> Reporter: Benjamin Jakobus
> Assignee: Benjamin Jakobus
> Priority: Minor
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> I have found some minor optimization issues in the codebase, which I would
> like to rectify and contribute. Specifically, these are:
> The optimizations that could be applied to Hive's code base are as follows:
> 1. Use StringBuffer when appending strings - In 184 instances, the
> concatination operator (+=) was used when appending strings. This is
> inherintly inefficient - instead Java's StringBuffer or StringBuilder class
> should be used. 12 instances of this optimization can be applied to the
> GenMRSkewJoinProcessor class and another three to the optimizer. CliDriver
> uses the + operator inside a loop, so does the column projection utilities
> class (ColumnProjectionUtils) and the aforementioned skew-join processor.
> Tests showed that using the StringBuilder when appending strings is 57\%
> faster than using the + operator (using the StringBuffer took 122
> milliseconds whilst the + operator took 284 milliseconds). The reason as to
> why using the StringBuffer class is preferred over using the + operator, is
> because
> String third = first + second;
> gets compiled to:
> StringBuilder builder = new StringBuilder( first );
> builder.append( second );
> third = builder.toString();
> Therefore, when building complex strings, that, for example involve loops,
> require many instantiations (and as discussed below, creating new objects
> inside loops is inefficient).
> 2. Use arrays instead of List - Java's java.util.Arrays class asList method
> is a more efficient at creating creating lists from arrays than using loops
> to manually iterate over the elements (using asList is computationally very
> cheap, O(1), as it merely creates a wrapper object around the array; looping
> through the list however has a complexity of O(n) since a new list is created
> and every element in the array is added to this new list). As confirmed by
> the experiment detailed in Appendix D, the Java compiler does not
> automatically optimize and replace tight-loop copying with asList: the
> loop-copying of 1,000,000 items took 15 milliseconds whilst using asList is
> instant.
> Four instances of this optimization can be applied to Hive's codebase (two of
> these should be applied to the Map-Join container - MapJoinRowContainer) -
> lines 92 to 98:
> for (obj = other.first(); obj != null; obj = other.next()) {
> ArrayList<Object> ele = new ArrayList(obj.length);
> for (int i = 0; i < obj.length; i++) {
> ele.add(obj[i]);
> }
> list.add((Row) ele);
> }
> 3. Unnecessary wrapper object creation - In 31 cases, wrapper object creation
> could be avoided by simply using the provided static conversion methods. As
> noted in the PMD documentation, "using these avoids the cost of creating
> objects that also need to be garbage-collected later."
> For example, line 587 of the SemanticAnalyzer class, could be replaced by the
> more efficient parseDouble method call:
> // Inefficient:
> Double percent = Double.valueOf(value).doubleValue();
> // To be replaced by:
> Double percent = Double.parseDouble(value);
> Our test case in Appendix D confirms this: converting 10,000 strings into
> integers using Integer.parseInt(gen.nextSessionId()) (i.e. creating an
> unnecessary wrapper object) took 119 on average; using parseInt() took only
> 38. Therefore creating even just one unnecessary wrapper object can make your
> code up to 68% slower.
> 4. Converting literals to strings using + "" - Converting literals to strings
> using + "" is quite inefficient (see Appendix D) and should be done by
> calling the toString() method instead: converting 1,000,000 integers to
> strings using + "" took, on average, 1340 milliseconds whilst using the
> toString() method only required 1183 milliseconds (hence adding empty strings
> takes nearly 12% more time).
> 89 instances of this using + "" when converting literals were found in Hive's
> codebase - one of these are found in the JoinUtil.
> 5. Avoid manual copying of arrays - Instead of copying arrays as is done in
> GroupByOperator on line 1040 (see below), the more efficient System.arraycopy
> can be used (arraycopy is a native method meaning that the entire memory
> block is copied using memcpy or mmove).
> // Line 1040 of the GroupByOperator
> for (int i = 0; i < keys.length; i++) {
> forwardCache[i] = keys[i];
> }
> Using System.arraycopy on an array of 10,000 strings was (close to) instant
> whilst the manual copy took 6 milliseconds.
> 11 instances of this optimization should be applied to the Hive codebase.
> 6. Avoiding instantiation inside loops - As noted in the PMD documentation,
> "new objects created within loops should be checked to see if they can
> created outside them and reused.".
> Declaring variables inside a loop (i from 0 to 10,000) took 300 milliseconds
> whilst declaring them outside took only 88 milliseconds (this can be
> explained by the fact that when declaring a variable outside the loop, its
> reference will be re-used for each iteration. However when declaring
> variables inside a loop, new references will be created for each iteration.
> In our case, 10,000 references will be created by the time that this loop
> finishes, meaning lots of work in terms of memory allocation and garbage
> collection). 1623 instances of this optimization can be applied.
> To summarize, I propose to modify the code to address issue 1 and issue 6
> (remaining issues (2 - 5) will be addressed later). Details are specified as
> sub-tasks.
--
This message was sent by Atlassian JIRA
(v6.1#6144)