[ 
https://issues.apache.org/jira/browse/NUMBERS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074927#comment-17074927
 ] 

Alex Herbert commented on NUMBERS-142:
--------------------------------------

I have investigated two outstanding items in the LinearCombinations class:
 # Impact of different summations for the extended method
 # Use of a cached array

Data has been collected as the median of 5 JMH runs using JDK 11.

Note: The data for the dot products used a condition number of 1e20. This is 
valid for lengths of 6 or above. Shorter lengths use the same data but 
truncated. The data is random but the dot products are unlikely to be ill 
conditioned. When this data is used to test branchless inlined versions it does 
not matter. The lack of ill-conditioned data may make the array based extended 
algorithm appear faster as it does zero elimination and reduces the length of 
the expanded floating point representation of the sum. This however does 
represent the common real world use case where the dot product is not 
ill-conditioned.
h2. Impact of different summations for the extended method

I tested:
||Method||Description||
|dot2s|Ogita's dot2s algorithm (lower baseline)|
|extended|The extended method of Shewchuk. Summation sums the values in the 
expansion. The sum is accurate to 1 ULP.|
|extended2|The extended method of Shewchuk. Summation uses a running two-sum 
algorithm over the expansion for a 2-fold precision. The initial sum is 
accurate to 1 ULP which is then added to the sum of the round-off which is also 
accurate to 1 ULP. In every case I generated this is exact. There may be some 
extremely rare cases when this fails to be perfect.|
|extended_exact|The extended method of Shewchuk. Summation uses a sticky sum 
algorithm over the expansion. This carries the round-off lost at each sum up to 
the next number as a single sticky bit. This is done for all but the final sum 
(to avoid including a single bit, or 1 ULP, error in the result). The result is 
exact. The algorithm requires certain conditions to be met in the summation 
which take time to ensure.|
|exact|Computation using BigDecimal (upper baseline)|

There is data for the inline methods of length 2, 3, and 4 and then the array 
method (marked with an 'A' suffix).
||Length||dot2s||extended||extended2||extended_exact||exact||dot2s A||extended 
A||extended2 A||extended_exact A||exact A||extended2 A / dot2s A||
|2|306940.4537|361433.2|389010.3|580662.6|44250788.48|344090.5839|363567.5|420009.3|648060.8|46977505.64|1.22|
|3|364950.9996|578679.6|667644.4|859407.3|54986485.63|410494.1042|548362.0|636538.5|988465.9|57459555.72|1.55|
|4|471956.165|853919.3|976081.7|1189582.2|66924264.38|489863.8208|1236478.0|1228542.2|1421218.7|62386967.71|2.51|
|8| | | | | |704726.5403|3225953.8|3556960.0|3174642.9|87008595.67|5.05|
|16| | | | | |935433.6645|6776523.9|6892102.2|6873813.5|143016906.5|7.37|
|32| | | | | |1727660.655|13942893.3|14003726.8|14099622.1|255850004.5|8.11|
|64| | | | | |3486007.003|29031093.3|29523472.5|29618878.5|485887352.7|8.47|

!linear_inline_vs_array.jpg!
h3. Observations

The inline methods do not actually provide much performance improvement over 
using an array on input data. The only noticeable difference is for length 4 
for the extended method which is about 50% slower. The lack of difference is 
due to the implementation in the extended method that uses inlined two-products 
that are then summed. So the array and inline method for length 2 are 
identical, length 3 just adds a single extra product to the sum, and length 4 
computes 2 inlined two-products and sums them. The array method uses zero 
elimination and so has a lot of conditional if statements. The inlined method 
does not use zero-elimination: zeros will have no effect on the sum. It is 
useful to avoid zeros to reduce the length of the growing expansion. This is 
only relevant when the number of products is high. For short lengths including 
the zeros is faster than testing for and excluding them.

The extended_exact method is slower than the extended2 method by a noticeable 
amount when the length is short (2, 3, 4). At larger lengths (>= 8) the final 
summation is a small part of the run-time and there is no difference. However 
given that I cannot create an example where the extended2 method does not 
compute the exact result it seems the use of the sticky summation is overkill. 
The simple method to use a running two-sum is easy to implement and branch 
free. This makes it fast enough on small lengths that the extra overhead verses 
a simple sum is less than 10%. The result is almost ensured to be exact. It 
will be within 1 ULP.

Using the extended2 method verses the fast dot2s will have a performance impact 
dependent on the condition number of the dot product and the length of the dot 
product. For short lengths (2 to 4) the method is about 1.2 to 2.5x slower. 
Long dot products are much slower (8x). The runtime is insignificant compared 
to using BigDecimal which is a magnitude slower.
h2. Use of a cached array

Starting from the dot3 method the summation requires allocating an array to 
hold intermediate values. I created a variant method that accepts a 
{{IntFunction<double[]>}} in the constructor. This is used to create the 
double[]. The implementation used in the benchmark holds a reference to a 
cached array. It creates a new array if the required size is bigger than the 
cached value, otherwise it returns the previous array.

I tested the speed on the fastest and slowest methods that require an array:
||Length||dot3||dot3 cached||extended||extended cached||
|2|357740.793|454667.745|363567.489|351322.843|
|3|452966.951|533987.65|548361.956|546975.698|
|4|561549.896|608230.381|1236478|1215191.01|
|8|898350.916|1252854.4|3225953.81|3123528.03|
|16|1869122.7|1671168.34|6776523.91|6784977.07|
|32|3666645.44|3452745.79|13942893.3|14103241.6|
|64|7469366.79|7014372.78|29031093.3|29780801.2|

!linear_cached.jpg!
h3. Observations

There is no difference for the slow method (extended). For the fast method 
using the cache for small array sizes is slower. The JVM can allocate small 
objects on a local thread without using global memory which requires some level 
of synchronisation. Officially this is known as the Thread Local Allocation 
Buffer (TLAB). When the length is large enough that the TLAB cannot be used 
then the method is so slow that the allocation of the buffer is insignificant. 
So a variant to recycle a buffer is not required.

 
h2. Discussion

Do we change the current LinearCombinations class to use Shewchuk's method?

Or do we provide an interface and different implementations?

I think that use of an interface would be more flexible going forward. 
Initially there could be just a single implementation using Shewchuk's method. 
More methods could be added later if requested. I do see the use of the fast 
dot2s method (which is the current method in LinearCombinations) as it can be 
done with no array allocation for an effective quad precision (128-bit) 
summation of the dot product. That may be enough for most use cases.

I am going to update the PR with all the latest code and merge to master. Thus 
the implementation will reside in the JMH examples module and can be promoted 
to the official arrays module if necessary.




> Improve LinearCombination accuracy during summation of the round-off errors
> ---------------------------------------------------------------------------
>
>                 Key: NUMBERS-142
>                 URL: https://issues.apache.org/jira/browse/NUMBERS-142
>             Project: Commons Numbers
>          Issue Type: Improvement
>          Components: arrays
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Assignee: Alex Herbert
>            Priority: Minor
>         Attachments: array_performance.jpg, cond_no.jpg, dot.jpg, 
> error_vs_condition_no.jpg, inline_perfomance.jpg, linear_cached.jpg, 
> linear_inline_vs_array.jpg
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> The algorithm in LinearCombination is an implementation of dot2 from [Ogita 
> el al|http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.1547] 
> (algorithm 5.3). There is a subtle difference in that the original dot2 
> algorithm sums the round-off from the products and the round-off from the 
> product summation together. The method in LinearCombination sums them 
> separately (using an extra variable) and adds them at the end. This actually 
> improves the accuracy under conditions where the round-off is of greater sum 
> than the products, as described below.
> The dot2 algorithm suffers when the total sum is close to 0 but the 
> intermediate sum is far enough from zero that there is a large difference 
> between the exponents of summed terms and the final result. In this case the 
> sum of the round-off is more important than the sum of the products which due 
> to massive cancellation is zero. The case is important for Complex numbers 
> which require a computation of log1p(x^2+y^2-1) when x^2+y^2 is close to 1 
> such that log(1) would be ~zero yet the true logarithm is representable to 
> very high precision.
> This can be protected against by using the dotK algorithm of Ogita et al with 
> K>2. This saves all the round-off parts from each product and the running 
> sum. These are subject to an error free transform that repeatedly adds 
> adjacent pairs to generate a new split pair with a closer upper and lower 
> part. Over time this will order the parts from low to high and these can be 
> summed low first for an error free dot product.
> Using this algorithm with a single pass (K=3 for dot3) removes the 
> cancellation error observed in the mentioned use case. Adding a single pass 
> over the parts changes the algorithm from 25n floating point operations 
> (flops) to 31n flops for the sum of n products.
> A second change for the algorithm is to switch to using 
> [Dekker's|https://doi.org/10.1007/BF01397083] algorithm (Dekker, 1971) to 
> split the number. This extracts two 26-bit mantissas from a 53-bit mantis (in 
> IEEE 754 the leading bit in front of the of the 52-bit mantissa is assumed to 
> be 1). This is done by multiplication by 2^s+1 with s = ceil(53/2) = 27:
> big = (2^s+1) * a
> a_hi = (big - (big - a))
> The extra bit of precision is carried into the sign of the low part of the 
> split number.
> This is in contrast to the method in LinearCombination that uses a simple 
> mask on the long representation to obtain the a_hi part in 26-bits and the 
> lower part will be 27 bits.
> The advantage of Dekker's method is it produces 2 parts with 26 bits in the 
> mantissa that can be multiplied exactly. The disadvantage is the potential 
> for overflow requiring a branch condition to check for extreme values.
> It also appropriately handles very small sub-normal numbers that would be 
> masked to create a 0 high part with all the non-zero bits left in the low 
> part using the current method. This will have knock on effects on split 
> multiplication which requires the high part to be larger.
> A simple change to the current implementation to use Dekker's split improves 
> the precision on a wide range of test data (not shown).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to