[
https://issues.apache.org/jira/browse/STATISTICS-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17935547#comment-17935547
]
Alex Herbert commented on STATISTICS-89:
----------------------------------------
h2. Statistic Performance
I updated the benchmark with all the implementations for DoubleStatistic. Here
is a plot of the performance against the size of the array:
!array.png|width=1200,height=800!
There is some overhead at small lengths but otherwise the creation speed is
approximately linear with length. The slowest statistics are the sum of logs
and the geometric mean which uses sum of logs. Then the moment statistics in
descending order of complexity: kurtosis, skewness; variance; mean.
The product is a strange outlier as it is faster than the sum at small lengths
but slower at large lengths. It uses only 1 float-operation (FLOP) per input
value so the slowness at large lengths compared to the sum which uses multiple
FLOPS is strange. Note that the random input uses a uniform distribution around
zero for logarithm (so the sum of logs is zero). Thus the product should be
centred around 1 and not under/overflow.
h2. Sub Ranges
As a simple test I have added a benchmark that copies half of the input array
using Arrays.copyOfRange and then computes the statistic on this smaller
subset. This is the current method to support creation of a statistic from part
of an array while still using the array creation API. An alternative is to
stream only part of the array and populate the statistic using the accept
method:
{code:java}
int from = ...
int to = ...
double[] data = ...
Mean s = Mean.of(Arrays.copyOfRange(data, from, to));
// Or
Mean s = Mean.create();
Arrays.stream(data, from, to).forEach(s::accept);
{code}
Note however that some statistics are more accurate when operating on an array
as a multiple pass algorithm can be used. Thus it is beneficial for accuracy to
copy the range at the cost of memory usage.
If the sub-range is half of the input array then the benchmark with the full
range and half range can be performed using lengths that are powers of 2. The
half range lengths can then be compared to the equivalent full length range on
the previous array length. This shows that performing an array copy is a
noticeable relative overhead for fast statistics, but less so for slow
statistics. Here is a plot of min, mean, and sum of logs:
!array_vs_range.png|width=1200,height=800!
The copy of the array adds a fixed time and memory overhead to the cost. Here
is a table for length 16384:
||Statistic||Array||ArrayRange||Relative||
|MAX|8868.8|12227.8|1.379|
|MIN|8880.4|12263.8|1.381|
|SUM_OF_SQUARES|13289.0|16705.3|1.257|
|MEAN|14793.1|18583.1|1.256|
|SUM|14665.9|19235.3|1.312|
|PRODUCT|17709.6|21138.0|1.194|
|VARIANCE|28379.2|31789.7|1.120|
|STANDARD_DEVIATION|28421.8|31986.0|1.125|
|SKEWNESS|42220.4|45217.5|1.071|
|KURTOSIS|55291.7|59348.3|1.073|
|GEOMETRIC_MEAN|164425.8|170517.3|1.037|
|SUM_OF_LOGS|161108.3|175346.1|1.088|
For the slowest statistic the performance gain of avoiding an array copy is
around 5%. This may change depending on the amount of data to copy. For faster
statistics it would be beneficial to offer an array subrange API. For any user
that is memory limited it would be beneficial to offer a subrange API to avoid
array copies.
No results are available for computation of multiple statistics. However the
summary statistics aggregators simply call each statistic instance with the
same array. Thus the total runtime would be expected to be the runtime of the
ArrayRange result for the first statistic to include the sub-range copy and
then the runtime of the Array result for the remaining statistics. For example:
||Statistic||Array||ArrayRange||Relative||
|MAX, MIN, MEAN|32542.3|35901.3|1.103|
|MAX, MIN, VARIANCE|46128.3|49487.3|1.073|
h2. Sub Range API
Here are some options for a range API. Note that the current API uses a varargs
signature. To avoid method mismatch a range would have to be used with an array
argument and the [from, to) range, e.g.:
{code:java}
// Current
public static Min of(double... values)
// Range API options
public static Min of(double[] values, int from, int to)
public static Min ofRange(double[] values, int from, int to)
// Possible clash using varargs:
// Min.of(0, 2)
// Min.of(0, 2, 1.0, 2.0, 3.0, 4.0)
public static Min of(int from, int to, double... values)
{code}
> Benchmark statistic generation time using array input
> -----------------------------------------------------
>
> Key: STATISTICS-89
> URL: https://issues.apache.org/jira/browse/STATISTICS-89
> Project: Commons Statistics
> Issue Type: Task
> Components: descriptive
> Reporter: Alex Herbert
> Priority: Minor
> Attachments: array.png, array_vs_range.png
>
>
> Update the JMH benchmark for creation of a statistic from an array to support
> the current implementations. (Note: At present the benchmark uses a simple
> sum implementation to test different loop/stream methods.)
> Investigate the performance of creation of statistics from a sub-range of an
> array. Currently this is not possible and requires a copy of the sub-range,
> e.g. to get the mean of the last half of an array:
> {code:java}
> double[] x = ...
> int from = x.length >> 1;
> int to = x.length;
> Mean m = Mean.of(Arrays.copyOfRange(x, from, to))
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)