Thank you for the feedback and confirmation Eric. I am looking forward to the new statistics libraries. I am glad that the work continuous on these libraries and that they will be available to the community to improve and enhance our projects.
On Wed, May 29, 2019 at 10:56 PM Eric Barnhill <[email protected]> wrote: > Hi Marco, > > Thanks a lot for this feedback. I am one of the contribs building out the > new commons-numbers release which will replace much of commons-math > including the statistics libraries. > > I took a look at the commons-math code for Median. Based on my reading of > it I am not surprised by what you report. Median just invokes Percentile. > Percentile does seem to have some overhead, for example lots of range and > null checking. As as one off Median is probably not as efficient as a > simple sort. However it can store data and recalculate percentiles, and > there it might be efficient to use, as well as having flexibility in how it > is implemented. > > I quite agree that most users simply want to call median() on their array > and expect at least as efficient an algorithm as they could hand code > themselves. For medians as we all learned in CS 101 class, for a guaranteed > result one cannot improve on O(n log n) time, just sorting like you > mentioned, however a method can deliver O(log n) time on average, but with > a potential worst case of O(n^2), and the user could be given this choice > of implementation. > > The fact that the commons-math contributors did not seem aware of this > finding shows some of the occasional weaknesses I am finding as we design a > new, fleeter library in commons-numbers. We are redesigning the statistics > libraries over this summer with support from Google Summer of Code, so I > will see what I can do. Without feedback like yours we would not find these > points of improvement, so thanks. > > On Wed, May 29, 2019 at 3:24 AM Marco Neumann <[email protected]> > wrote: > > > I am evaluating the use of Apache Math Commons Median for the querying of > > large data sets in another Apache project called Apache Jena. > > > > In my preliminary performance tests I was surprised to find that a simple > > implementation of a median function with Arrays.sort() and a programmatic > > selection of the median value yields much faster results > > than Median().evaluate() or DescriptiveStatistics.getPercentile(50). > > > > Since we only use this function for Arrays of confirmed numbers is > there a > > particular benefit in using Apache Commons Math for this task or are we > > better advised to use our own implementation here? > > > > Thank You > > > -- --- Marco Neumann KONA
