[ https://issues.apache.org/jira/browse/NUMBERS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038014#comment-17038014 ]
Alex Herbert commented on NUMBERS-142: -------------------------------------- I have put all the code together for this into a [PR 77|https://github.com/apache/commons-numbers/pull/77] that implements the LinearCombination as an interface. The PR puts the code in the JMH module for performance testing. Have a look and see what you think of the design. Any current use of the LinearCombination class can be replaced by choosing an implementation from LinearCombinations: {code:java} // current double sum1 = LinearCombination.value(1, 2, 3, 4, 5, 6); // new double sum2 = LinearCombinations.Exact.INSTANCE.value(1, 2, 3, 4, 5, 6); {code} The implementations and references for the performance test are: ||Name||Description|| |Standard|The standard precision dot product.| |Current|The current implementation in LinearCombination. This is a variant of Ogita's Dot2s algorithm for 2-fold precision that uses the summation of the round-off parts in a different order to Ogita's paper.| |Dekker|Dekker's double-length precision dot product.| |Dot2s|Ogita's Dot2s algorithm for 2-fold precision. This method does not require a working array (see discussion below).| |DotK|Ogita's DotK algorithm for K-fold precision. The k can be specified as a parameter to the instance. Singletons are provided for Dot3 through to Dot7.| |Expanded|Shewchuk's arbitrary precision algorithms for exact computation using variable length p-bit numbers where the bits are stored in doubles.| |Exact|Uses BigDecimal.| The raw output of the performance test is: {noformat} Benchmark (c) (length) (name) (size) Mode Cnt Score Error Units LinearCombinationPerformance.twoD 1e20 N/A standard 1000 avgt 5 6710.841 ± 268.219 ns/op LinearCombinationPerformance.twoD 1e20 N/A current 1000 avgt 5 21571.098 ± 589.948 ns/op LinearCombinationPerformance.twoD 1e20 N/A dekker 1000 avgt 5 22798.832 ± 535.729 ns/op LinearCombinationPerformance.twoD 1e20 N/A dot2s 1000 avgt 5 15739.809 ± 279.495 ns/op LinearCombinationPerformance.twoD 1e20 N/A dot2 1000 avgt 5 17551.580 ± 299.583 ns/op LinearCombinationPerformance.twoD 1e20 N/A dot3 1000 avgt 5 24441.026 ± 912.896 ns/op LinearCombinationPerformance.twoD 1e20 N/A dot4 1000 avgt 5 30620.358 ± 334.428 ns/op LinearCombinationPerformance.twoD 1e20 N/A dot5 1000 avgt 5 38060.952 ± 910.109 ns/op LinearCombinationPerformance.twoD 1e20 N/A dot6 1000 avgt 5 45877.618 ± 2071.469 ns/op LinearCombinationPerformance.twoD 1e20 N/A dot7 1000 avgt 5 53256.087 ± 3022.064 ns/op LinearCombinationPerformance.twoD 1e20 N/A extended 1000 avgt 5 24391.977 ± 1150.544 ns/op LinearCombinationPerformance.twoD 1e20 N/A exact 1000 avgt 5 4293537.179 ± 29041.774 ns/op LinearCombinationPerformance.threeD 1e20 N/A standard 1000 avgt 5 7677.865 ± 274.872 ns/op LinearCombinationPerformance.threeD 1e20 N/A current 1000 avgt 5 30349.426 ± 1436.852 ns/op LinearCombinationPerformance.threeD 1e20 N/A dekker 1000 avgt 5 25110.456 ± 2464.337 ns/op LinearCombinationPerformance.threeD 1e20 N/A dot2s 1000 avgt 5 22854.831 ± 1008.271 ns/op LinearCombinationPerformance.threeD 1e20 N/A dot2 1000 avgt 5 23625.147 ± 703.822 ns/op LinearCombinationPerformance.threeD 1e20 N/A dot3 1000 avgt 5 26691.231 ± 270.333 ns/op LinearCombinationPerformance.threeD 1e20 N/A dot4 1000 avgt 5 36075.471 ± 445.362 ns/op LinearCombinationPerformance.threeD 1e20 N/A dot5 1000 avgt 5 55473.600 ± 22107.620 ns/op LinearCombinationPerformance.threeD 1e20 N/A dot6 1000 avgt 5 60561.452 ± 2094.282 ns/op LinearCombinationPerformance.threeD 1e20 N/A dot7 1000 avgt 5 68363.077 ± 2514.378 ns/op LinearCombinationPerformance.threeD 1e20 N/A extended 1000 avgt 5 44935.899 ± 908.291 ns/op LinearCombinationPerformance.threeD 1e20 N/A exact 1000 avgt 5 5152541.712 ± 74986.141 ns/op LinearCombinationPerformance.fourD 1e20 N/A standard 1000 avgt 5 8397.397 ± 77.708 ns/op LinearCombinationPerformance.fourD 1e20 N/A current 1000 avgt 5 39489.947 ± 1018.499 ns/op LinearCombinationPerformance.fourD 1e20 N/A dekker 1000 avgt 5 46615.843 ± 889.102 ns/op LinearCombinationPerformance.fourD 1e20 N/A dot2s 1000 avgt 5 29625.524 ± 578.690 ns/op LinearCombinationPerformance.fourD 1e20 N/A dot2 1000 avgt 5 31495.629 ± 4817.193 ns/op LinearCombinationPerformance.fourD 1e20 N/A dot3 1000 avgt 5 46525.082 ± 588.173 ns/op LinearCombinationPerformance.fourD 1e20 N/A dot4 1000 avgt 5 55568.213 ± 2678.669 ns/op LinearCombinationPerformance.fourD 1e20 N/A dot5 1000 avgt 5 65862.113 ± 4156.989 ns/op LinearCombinationPerformance.fourD 1e20 N/A dot6 1000 avgt 5 74365.334 ± 2252.189 ns/op LinearCombinationPerformance.fourD 1e20 N/A dot7 1000 avgt 5 83908.295 ± 2334.416 ns/op LinearCombinationPerformance.fourD 1e20 N/A extended 1000 avgt 5 66471.331 ± 1468.997 ns/op LinearCombinationPerformance.fourD 1e20 N/A exact 1000 avgt 5 6242338.287 ± 303731.813 ns/op LinearCombinationPerformance.nD 1e20 32 standard 1000 avgt 5 40881.169 ± 1028.620 ns/op LinearCombinationPerformance.nD 1e20 32 current 1000 avgt 5 488370.139 ± 6822.698 ns/op LinearCombinationPerformance.nD 1e20 32 dekker 1000 avgt 5 376277.106 ± 41055.472 ns/op LinearCombinationPerformance.nD 1e20 32 dot2s 1000 avgt 5 274352.368 ± 4212.051 ns/op LinearCombinationPerformance.nD 1e20 32 dot2 1000 avgt 5 362834.497 ± 4109.045 ns/op LinearCombinationPerformance.nD 1e20 32 dot3 1000 avgt 5 480087.086 ± 8505.884 ns/op LinearCombinationPerformance.nD 1e20 32 dot4 1000 avgt 5 596274.595 ± 8312.549 ns/op LinearCombinationPerformance.nD 1e20 32 dot5 1000 avgt 5 716167.473 ± 10982.713 ns/op LinearCombinationPerformance.nD 1e20 32 dot6 1000 avgt 5 838779.146 ± 36630.756 ns/op LinearCombinationPerformance.nD 1e20 32 dot7 1000 avgt 5 953464.494 ± 38360.647 ns/op LinearCombinationPerformance.nD 1e20 32 extended 1000 avgt 5 1442858.294 ± 27216.570 ns/op LinearCombinationPerformance.nD 1e20 32 exact 1000 avgt 5 24739454.322 ± 439917.888 ns/op {noformat} So on this machine the new Dot2s implementation is faster than the current implementation. The current implementation uses two loops in the computation over the length of the input vectors. The new implementation uses only 1 loop. The number of double multiply and add operations should be the same so I assume the second loop is consuming extra time. Dekker's method is slower than Dot2s (optimised for no array allocation) or even Dot2 (which has an array allocation) but still delivers only 2-fold precision. The Dekker method requires a double comparison to order the parts of a summation for each term of the dot product. The number of float operations is less than Dot2 but the branch for the comparison is slow. By my working the Extended method will have the same number of floating-point computations as DotK when k=n+1 where n is the length of the vector for the dot product. We see this here as the time for Dot3 (twoD), Dot4 (threeD) and Dot5 (fourD) are approximately the same as the Extended method. This has been noted in the comments on the class that for large k and short dot products it would be better to use the Extended method instead. But this has not actually been enforced in code by preventing using a large k for short dot products. h2. An enhancement One performance improvement not currently in the code in the PR is the use of a temporary array for computations. Currently all the methods are effectively static and thread safe. They allocate working arrays for each method call. This adds a lot of garbage collection overhead. When I tried a variation of Shewchuk's extended precision method as described in his paper it was slower until I allowed reuse of the working arrays. This method used 3 working arrays and not 1 so garbage collection overhead is high. Trying array reuse on other implementations also adds performance benefit. I scrapped the Shewchuk variation but it does open the question of whether to support thread-safe computation or allow faster thread-unsafe instances. The methods DotK and Extended require a working array of length 2N for a dot product on input arrays of length N. If this working array can be reused then the performance gain is about 10-20%. To avoid cluttering the functional interface for the dot products this could be done by allowing instances of the implementations to be created that are not thread safe as they reuse working space. This would be of benefit for this scenario: {code:java} int k = 4; // DotK will hold an internal working array, resized or reused as appropriate for each computation LinearCombination.ND lc = new LinearCombinations.DotK(k); for (int i = 0; i < BIG; i++) { double[] a = ...; double[] b = ...; double sum = lc.value(a, b); } {code} This could of course be extended by allowing the constructor to choose whether to be thread-safe or not. All of the standard instance singletons could be thread-safe but a user could construct an instance for each working thread that is performing these computations: {code:java} // thread-safe instance LinearCombination.ND lc1 = LinearCombinations.DotK.DOT_4; // thread-unsafe instance with local working array int k = 4; LinearCombination.ND lc2 = new LinearCombinations.DotK(k); {code} The use of a thread-local instance for faster computations may be of benefit in Commons Geometry for methods that use an extended precision computation with many vectors. > Improve LinearCombination accuracy during summation of the round-off errors > --------------------------------------------------------------------------- > > Key: NUMBERS-142 > URL: https://issues.apache.org/jira/browse/NUMBERS-142 > Project: Commons Numbers > Issue Type: Improvement > Components: arrays > Affects Versions: 1.0 > Reporter: Alex Herbert > Assignee: Alex Herbert > Priority: Minor > Attachments: array_performance.jpg, cond_no.jpg, > error_vs_condition_no.jpg, inline_perfomance.jpg > > Time Spent: 20m > Remaining Estimate: 0h > > The algorithm in LinearCombination is an implementation of dot2 from [Ogita > el al|http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.1547] > (algorithm 5.3). There is a subtle difference in that the original dot2 > algorithm sums the round-off from the products and the round-off from the > product summation together. The method in LinearCombination sums them > separately (using an extra variable) and adds them at the end. This actually > improves the accuracy under conditions where the round-off is of greater sum > than the products, as described below. > The dot2 algorithm suffers when the total sum is close to 0 but the > intermediate sum is far enough from zero that there is a large difference > between the exponents of summed terms and the final result. In this case the > sum of the round-off is more important than the sum of the products which due > to massive cancellation is zero. The case is important for Complex numbers > which require a computation of log1p(x^2+y^2-1) when x^2+y^2 is close to 1 > such that log(1) would be ~zero yet the true logarithm is representable to > very high precision. > This can be protected against by using the dotK algorithm of Ogita et al with > K>2. This saves all the round-off parts from each product and the running > sum. These are subject to an error free transform that repeatedly adds > adjacent pairs to generate a new split pair with a closer upper and lower > part. Over time this will order the parts from low to high and these can be > summed low first for an error free dot product. > Using this algorithm with a single pass (K=3 for dot3) removes the > cancellation error observed in the mentioned use case. Adding a single pass > over the parts changes the algorithm from 25n floating point operations > (flops) to 31n flops for the sum of n products. > A second change for the algorithm is to switch to using > [Dekker's|https://doi.org/10.1007/BF01397083] algorithm (Dekker, 1971) to > split the number. This extracts two 26-bit mantissas from a 53-bit mantis (in > IEEE 754 the leading bit in front of the of the 52-bit mantissa is assumed to > be 1). This is done by multiplication by 2^s+1 with s = ceil(53/2) = 27: > big = (2^s+1) * a > a_hi = (big - (big - a)) > The extra bit of precision is carried into the sign of the low part of the > split number. > This is in contrast to the method in LinearCombination that uses a simple > mask on the long representation to obtain the a_hi part in 26-bits and the > lower part will be 27 bits. > The advantage of Dekker's method is it produces 2 parts with 26 bits in the > mantissa that can be multiplied exactly. The disadvantage is the potential > for overflow requiring a branch condition to check for extreme values. > It also appropriately handles very small sub-normal numbers that would be > masked to create a 0 high part with all the non-zero bits left in the low > part using the current method. This will have knock on effects on split > multiplication which requires the high part to be larger. > A simple change to the current implementation to use Dekker's split improves > the precision on a wide range of test data (not shown). -- This message was sent by Atlassian Jira (v8.3.4#803005)