[ https://issues.apache.org/jira/browse/NUMBERS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074927#comment-17074927 ]
Alex Herbert commented on NUMBERS-142: -------------------------------------- I have investigated two outstanding items in the LinearCombinations class: # Impact of different summations for the extended method # Use of a cached array Data has been collected as the median of 5 JMH runs using JDK 11. Note: The data for the dot products used a condition number of 1e20. This is valid for lengths of 6 or above. Shorter lengths use the same data but truncated. The data is random but the dot products are unlikely to be ill conditioned. When this data is used to test branchless inlined versions it does not matter. The lack of ill-conditioned data may make the array based extended algorithm appear faster as it does zero elimination and reduces the length of the expanded floating point representation of the sum. This however does represent the common real world use case where the dot product is not ill-conditioned. h2. Impact of different summations for the extended method I tested: ||Method||Description|| |dot2s|Ogita's dot2s algorithm (lower baseline)| |extended|The extended method of Shewchuk. Summation sums the values in the expansion. The sum is accurate to 1 ULP.| |extended2|The extended method of Shewchuk. Summation uses a running two-sum algorithm over the expansion for a 2-fold precision. The initial sum is accurate to 1 ULP which is then added to the sum of the round-off which is also accurate to 1 ULP. In every case I generated this is exact. There may be some extremely rare cases when this fails to be perfect.| |extended_exact|The extended method of Shewchuk. Summation uses a sticky sum algorithm over the expansion. This carries the round-off lost at each sum up to the next number as a single sticky bit. This is done for all but the final sum (to avoid including a single bit, or 1 ULP, error in the result). The result is exact. The algorithm requires certain conditions to be met in the summation which take time to ensure.| |exact|Computation using BigDecimal (upper baseline)| There is data for the inline methods of length 2, 3, and 4 and then the array method (marked with an 'A' suffix). ||Length||dot2s||extended||extended2||extended_exact||exact||dot2s A||extended A||extended2 A||extended_exact A||exact A||extended2 A / dot2s A|| |2|306940.4537|361433.2|389010.3|580662.6|44250788.48|344090.5839|363567.5|420009.3|648060.8|46977505.64|1.22| |3|364950.9996|578679.6|667644.4|859407.3|54986485.63|410494.1042|548362.0|636538.5|988465.9|57459555.72|1.55| |4|471956.165|853919.3|976081.7|1189582.2|66924264.38|489863.8208|1236478.0|1228542.2|1421218.7|62386967.71|2.51| |8| | | | | |704726.5403|3225953.8|3556960.0|3174642.9|87008595.67|5.05| |16| | | | | |935433.6645|6776523.9|6892102.2|6873813.5|143016906.5|7.37| |32| | | | | |1727660.655|13942893.3|14003726.8|14099622.1|255850004.5|8.11| |64| | | | | |3486007.003|29031093.3|29523472.5|29618878.5|485887352.7|8.47| !linear_inline_vs_array.jpg! h3. Observations The inline methods do not actually provide much performance improvement over using an array on input data. The only noticeable difference is for length 4 for the extended method which is about 50% slower. The lack of difference is due to the implementation in the extended method that uses inlined two-products that are then summed. So the array and inline method for length 2 are identical, length 3 just adds a single extra product to the sum, and length 4 computes 2 inlined two-products and sums them. The array method uses zero elimination and so has a lot of conditional if statements. The inlined method does not use zero-elimination: zeros will have no effect on the sum. It is useful to avoid zeros to reduce the length of the growing expansion. This is only relevant when the number of products is high. For short lengths including the zeros is faster than testing for and excluding them. The extended_exact method is slower than the extended2 method by a noticeable amount when the length is short (2, 3, 4). At larger lengths (>= 8) the final summation is a small part of the run-time and there is no difference. However given that I cannot create an example where the extended2 method does not compute the exact result it seems the use of the sticky summation is overkill. The simple method to use a running two-sum is easy to implement and branch free. This makes it fast enough on small lengths that the extra overhead verses a simple sum is less than 10%. The result is almost ensured to be exact. It will be within 1 ULP. Using the extended2 method verses the fast dot2s will have a performance impact dependent on the condition number of the dot product and the length of the dot product. For short lengths (2 to 4) the method is about 1.2 to 2.5x slower. Long dot products are much slower (8x). The runtime is insignificant compared to using BigDecimal which is a magnitude slower. h2. Use of a cached array Starting from the dot3 method the summation requires allocating an array to hold intermediate values. I created a variant method that accepts a {{IntFunction<double[]>}} in the constructor. This is used to create the double[]. The implementation used in the benchmark holds a reference to a cached array. It creates a new array if the required size is bigger than the cached value, otherwise it returns the previous array. I tested the speed on the fastest and slowest methods that require an array: ||Length||dot3||dot3 cached||extended||extended cached|| |2|357740.793|454667.745|363567.489|351322.843| |3|452966.951|533987.65|548361.956|546975.698| |4|561549.896|608230.381|1236478|1215191.01| |8|898350.916|1252854.4|3225953.81|3123528.03| |16|1869122.7|1671168.34|6776523.91|6784977.07| |32|3666645.44|3452745.79|13942893.3|14103241.6| |64|7469366.79|7014372.78|29031093.3|29780801.2| !linear_cached.jpg! h3. Observations There is no difference for the slow method (extended). For the fast method using the cache for small array sizes is slower. The JVM can allocate small objects on a local thread without using global memory which requires some level of synchronisation. Officially this is known as the Thread Local Allocation Buffer (TLAB). When the length is large enough that the TLAB cannot be used then the method is so slow that the allocation of the buffer is insignificant. So a variant to recycle a buffer is not required. h2. Discussion Do we change the current LinearCombinations class to use Shewchuk's method? Or do we provide an interface and different implementations? I think that use of an interface would be more flexible going forward. Initially there could be just a single implementation using Shewchuk's method. More methods could be added later if requested. I do see the use of the fast dot2s method (which is the current method in LinearCombinations) as it can be done with no array allocation for an effective quad precision (128-bit) summation of the dot product. That may be enough for most use cases. I am going to update the PR with all the latest code and merge to master. Thus the implementation will reside in the JMH examples module and can be promoted to the official arrays module if necessary. > Improve LinearCombination accuracy during summation of the round-off errors > --------------------------------------------------------------------------- > > Key: NUMBERS-142 > URL: https://issues.apache.org/jira/browse/NUMBERS-142 > Project: Commons Numbers > Issue Type: Improvement > Components: arrays > Affects Versions: 1.0 > Reporter: Alex Herbert > Assignee: Alex Herbert > Priority: Minor > Attachments: array_performance.jpg, cond_no.jpg, dot.jpg, > error_vs_condition_no.jpg, inline_perfomance.jpg, linear_cached.jpg, > linear_inline_vs_array.jpg > > Time Spent: 40m > Remaining Estimate: 0h > > The algorithm in LinearCombination is an implementation of dot2 from [Ogita > el al|http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.1547] > (algorithm 5.3). There is a subtle difference in that the original dot2 > algorithm sums the round-off from the products and the round-off from the > product summation together. The method in LinearCombination sums them > separately (using an extra variable) and adds them at the end. This actually > improves the accuracy under conditions where the round-off is of greater sum > than the products, as described below. > The dot2 algorithm suffers when the total sum is close to 0 but the > intermediate sum is far enough from zero that there is a large difference > between the exponents of summed terms and the final result. In this case the > sum of the round-off is more important than the sum of the products which due > to massive cancellation is zero. The case is important for Complex numbers > which require a computation of log1p(x^2+y^2-1) when x^2+y^2 is close to 1 > such that log(1) would be ~zero yet the true logarithm is representable to > very high precision. > This can be protected against by using the dotK algorithm of Ogita et al with > K>2. This saves all the round-off parts from each product and the running > sum. These are subject to an error free transform that repeatedly adds > adjacent pairs to generate a new split pair with a closer upper and lower > part. Over time this will order the parts from low to high and these can be > summed low first for an error free dot product. > Using this algorithm with a single pass (K=3 for dot3) removes the > cancellation error observed in the mentioned use case. Adding a single pass > over the parts changes the algorithm from 25n floating point operations > (flops) to 31n flops for the sum of n products. > A second change for the algorithm is to switch to using > [Dekker's|https://doi.org/10.1007/BF01397083] algorithm (Dekker, 1971) to > split the number. This extracts two 26-bit mantissas from a 53-bit mantis (in > IEEE 754 the leading bit in front of the of the 52-bit mantissa is assumed to > be 1). This is done by multiplication by 2^s+1 with s = ceil(53/2) = 27: > big = (2^s+1) * a > a_hi = (big - (big - a)) > The extra bit of precision is carried into the sign of the low part of the > split number. > This is in contrast to the method in LinearCombination that uses a simple > mask on the long representation to obtain the a_hi part in 26-bits and the > lower part will be 27 bits. > The advantage of Dekker's method is it produces 2 parts with 26 bits in the > mantissa that can be multiplied exactly. The disadvantage is the potential > for overflow requiring a branch condition to check for extreme values. > It also appropriately handles very small sub-normal numbers that would be > masked to create a 0 high part with all the non-zero bits left in the low > part using the current method. This will have knock on effects on split > multiplication which requires the high part to be larger. > A simple change to the current implementation to use Dekker's split improves > the precision on a wide range of test data (not shown). -- This message was sent by Atlassian Jira (v8.3.4#803005)