Thank you for the feedback and confirmation Eric. I am looking forward to
the new statistics libraries. I am glad that the work continuous on
these libraries and that they will be available to the community to improve
and enhance our projects.



On Wed, May 29, 2019 at 10:56 PM Eric Barnhill <[email protected]>
wrote:

> Hi Marco,
>
> Thanks a lot for this feedback. I am one of the contribs building out the
> new commons-numbers release which will replace much of commons-math
> including the statistics libraries.
>
> I took a look at the commons-math code for Median. Based on my reading of
> it I am not surprised by what you report. Median just invokes Percentile.
> Percentile does seem to have some overhead, for example lots of range and
> null checking. As as one off Median is probably not as efficient as a
> simple sort. However it can store data and recalculate percentiles, and
> there it might be efficient to use, as well as having flexibility in how it
> is implemented.
>
> I quite agree that most users simply want to call median() on their array
> and expect at least as efficient an algorithm as they could hand code
> themselves. For medians as we all learned in CS 101 class, for a guaranteed
> result one cannot improve on O(n log n) time, just sorting like you
> mentioned, however a method can deliver O(log n) time on average, but with
> a potential worst case of O(n^2), and the user could be given this choice
> of implementation.
>
> The fact that the commons-math contributors did not seem aware of this
> finding shows some of the occasional weaknesses I am finding as we design a
> new, fleeter library in commons-numbers. We are redesigning the statistics
> libraries over this summer with support from Google Summer of Code, so I
> will see what I can do. Without feedback like yours we would not find these
> points of improvement, so thanks.
>
> On Wed, May 29, 2019 at 3:24 AM Marco Neumann <[email protected]>
> wrote:
>
> > I am evaluating the use of Apache Math Commons Median for the querying of
> > large data sets in another Apache project called Apache Jena.
> >
> > In my preliminary performance tests I was surprised to find that a simple
> > implementation of a median function with Arrays.sort() and a programmatic
> > selection of the median value yields much faster results
> > than Median().evaluate() or DescriptiveStatistics.getPercentile(50).
> >
> > Since we only use this function for  Arrays of confirmed numbers is
> there a
> > particular benefit in using Apache Commons Math for this task or are we
> > better advised to use our own implementation here?
> >
> > Thank You
> >
>


-- 


---
Marco Neumann
KONA

Reply via email to