[jira] [Commented] (STATISTICS-89) Benchmark statistic generation time using array input

Alex Herbert (Jira) Fri, 14 Mar 2025 08:08:24 -0700


    [ 
https://issues.apache.org/jira/browse/STATISTICS-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17935547#comment-17935547
 ]


Alex Herbert commented on STATISTICS-89:
----------------------------------------

h2. Statistic Performance

I updated the benchmark with all the implementations for DoubleStatistic. Here 
is a plot of the performance against the size of the array:

!array.png|width=1200,height=800!

There is some overhead at small lengths but otherwise the creation speed is 
approximately linear with length. The slowest statistics are the sum of logs 
and the geometric mean which uses sum of logs. Then the moment statistics in 
descending order of complexity: kurtosis, skewness; variance; mean.

The product is a strange outlier as it is faster than the sum at small lengths 
but slower at large lengths. It uses only 1 float-operation (FLOP) per input 
value so the slowness at large lengths compared to the sum which uses multiple 
FLOPS is strange. Note that the random input uses a uniform distribution around 
zero for logarithm (so the sum of logs is zero). Thus the product should be 
centred around 1 and not under/overflow.
h2. Sub Ranges

As a simple test I have added a benchmark that copies half of the input array 
using Arrays.copyOfRange and then computes the statistic on this smaller 
subset. This is the current method to support creation of a statistic from part 
of an array while still using the array creation API. An alternative is to 
stream only part of the array and populate the statistic using the accept 
method:
{code:java}
int from = ...
int to = ...
double[] data = ...

Mean s = Mean.of(Arrays.copyOfRange(data, from, to));
// Or
Mean s = Mean.create();
Arrays.stream(data, from, to).forEach(s::accept);
{code}
Note however that some statistics are more accurate when operating on an array 
as a multiple pass algorithm can be used. Thus it is beneficial for accuracy to 
copy the range at the cost of memory usage.

If the sub-range is half of the input array then the benchmark with the full 
range and half range can be performed using lengths that are powers of 2. The 
half range lengths can then be compared to the equivalent full length range on 
the previous array length. This shows that performing an array copy is a 
noticeable relative overhead for fast statistics, but less so for slow 
statistics. Here is a plot of min, mean, and sum of logs:

!array_vs_range.png|width=1200,height=800!

The copy of the array adds a fixed time and memory overhead to the cost. Here 
is a table for length 16384:
||Statistic||Array||ArrayRange||Relative||
|MAX|8868.8|12227.8|1.379|
|MIN|8880.4|12263.8|1.381|
|SUM_OF_SQUARES|13289.0|16705.3|1.257|
|MEAN|14793.1|18583.1|1.256|
|SUM|14665.9|19235.3|1.312|
|PRODUCT|17709.6|21138.0|1.194|
|VARIANCE|28379.2|31789.7|1.120|
|STANDARD_DEVIATION|28421.8|31986.0|1.125|
|SKEWNESS|42220.4|45217.5|1.071|
|KURTOSIS|55291.7|59348.3|1.073|
|GEOMETRIC_MEAN|164425.8|170517.3|1.037|
|SUM_OF_LOGS|161108.3|175346.1|1.088|

For the slowest statistic the performance gain of avoiding an array copy is 
around 5%. This may change depending on the amount of data to copy. For faster 
statistics it would be beneficial to offer an array subrange API. For any user 
that is memory limited it would be beneficial to offer a subrange API to avoid 
array copies.

No results are available for computation of multiple statistics. However the 
summary statistics aggregators simply call each statistic instance with the 
same array. Thus the total runtime would be expected to be the runtime of the 
ArrayRange result for the first statistic to include the sub-range copy and 
then the runtime of the Array result for the remaining statistics. For example:
||Statistic||Array||ArrayRange||Relative||
|MAX, MIN, MEAN|32542.3|35901.3|1.103|
|MAX, MIN, VARIANCE|46128.3|49487.3|1.073|
h2. Sub Range API

Here are some options for a range API. Note that the current API uses a varargs 
signature. To avoid method mismatch a range would have to be used with an array 
argument and the [from, to) range, e.g.:
{code:java}
// Current
public static Min of(double... values)

// Range API options
public static Min of(double[] values, int from, int to)
public static Min ofRange(double[] values, int from, int to)

// Possible clash using varargs:
// Min.of(0, 2) 
// Min.of(0, 2, 1.0, 2.0, 3.0, 4.0)
public static Min of(int from, int to, double... values)
{code}

> Benchmark statistic generation time using array input
> -----------------------------------------------------
>
>                 Key: STATISTICS-89
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-89
>             Project: Commons Statistics
>          Issue Type: Task
>          Components: descriptive
>            Reporter: Alex Herbert
>            Priority: Minor
>         Attachments: array.png, array_vs_range.png
>
>
> Update the JMH benchmark for creation of a statistic from an array to support 
> the current implementations. (Note: At present the benchmark uses a simple 
> sum implementation to test different loop/stream methods.)
> Investigate the performance of creation of statistics from a sub-range of an 
> array. Currently this is not possible and requires a copy of the sub-range, 
> e.g. to get the mean of the last half of an array:
> {code:java}
> double[] x = ...
> int from = x.length >> 1;
> int to = x.length;
> Mean m = Mean.of(Arrays.copyOfRange(x, from, to))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (STATISTICS-89) Benchmark statistic generation time using array input

Reply via email to