[ 
https://issues.apache.org/jira/browse/STATISTICS-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707525#comment-17707525
 ] 

Anirudh Joshi commented on STATISTICS-54:
-----------------------------------------

Hello [~aherbert] and [~erans]. Hope you are doing well. My name is Anirudh and 
I am interested in contributing to this project as part of GSoC 2023. I am 
working on my proposal but would like to discuss my ideas with the community 
before I finalize my idea to see if I am thinking in the right direction.

I have been familiarizing myself with commons-stat/stat/descriptive project 
over the past few days. I saw that the current implementation of 
SummaryStatistics works only with sequential stream of values since the 
combiner parameter of Stream::collect is never invoked in this case. Our goal 
is to add support for parallel streams too since it would definitely would help 
us reduce processing time to compute Summary Statistics esp. when the dataset 
size is large.

An important ingredient that we need to support streams is the `merge` 
functionality. We need the ability to merge two partially constructed 
`StorelessUnivariateStatistic` objects. Once we implement this for all 
implementing classes of StorelessUnivariateStatistic we would be able to 
compute partial SummaryStatistic and use our merge function to aggregate these 
partially constructed SummaryStatistic objects to a result SummaryStatistic 
object that gives out the statistics for the entire dataset. 

My idea is to define a generic interface as follows
{code:java}
public interface StatisticAccumulator<T extends StorelessUnivariateStatistic> {

    // Add a single value to the accumulator
    void add(double d);
    
    // To ensure that the parameter to merge function are bound to an 
accumulator impl of the same statistic type T
    <U extends StatisticAccumulator<T>> void merge(U other);

    // Merge two partially constructed StorelessUnivariateStatistic objects 
    void merge(T other);

    // Get the statistic we are trying to accumulate
    T get();

} {code}
And have implementations for various statistics we have such as 
MeanAccumulator, GeometricMeanAccumulator, VarianceAccumulator etc.

A sample usage (assuming we have an implementation for MeanAccumulator) would 
look like
{code:java}
List<Double> data = Arrays.asList(1.0, 2.0, 3.0, 4.0, -1.0);
Mean mean = data.parallelStream()
        .collect(MeanAccumulator::new, MeanAccumulator::add, 
MeanAccumulator::merge)
        .get(); {code}
I have a [proof of concept 
PR|https://github.com/apache/commons-math/compare/master...ani5rudh:commons-math:STATISTICS-54-Proof-Of-Concept]
 for my approach with implementation for MeanAccumulator.

I am still a student learning principles of Object Oriented Design and 
Modelling, so my approach may not be perfect. I would like to know your 
thoughts on my approach so that I fix and improve my design. Your feedback is 
very valuable for my learning and developing my skills.

I also wanted to know if the scope of the project as far as GSoC is concerned 
is to add stream support along with unit tests for all the sub classes of 
`AbstractStorelessUnivariateStatistic` (around 17 of them) or is it a subset of 
these ? I am asking since to get clarity on the goals and plan accordingly to 
achieve the goals in 12 weeks of GSoC coding period. Please let me know. Thanks 
in advance!

> [GSoC] Summary statistics API for Java 8 streams
> ------------------------------------------------
>
>                 Key: STATISTICS-54
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-54
>             Project: Commons Statistics
>          Issue Type: Wish
>          Components: descriptive
>            Reporter: Alex Herbert
>            Priority: Minor
>              Labels: full-time, gsoc, gsoc2022, gsoc2023
>             Fix For: 1.0
>
>
> Placeholder for tasks that could be undertaken in this year's 
> [GSoC|https://summerofcode.withgoogle.com/].
> Ideas:
> - Design an updated summary statistics API for use with Java 8 streams based 
> on the summary statistic implementations in the Commons Math 
> {{stat.descriptive}} package including {{{}moments{}}}, {{rank}} and 
> {{summary}} sub-packages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to