Re: [math] Recent commits to stat, util packages

Mark R. Diggory Sat, 05 Jul 2003 21:44:16 -0700

Phil Steitz wrote:

First, the testAddElementRolling test case in FixedDoubleArrayTest will not
compile, since it is trying to access what is now a private field in
FixedDoubleArray (internalArray). The changes to FixedDoubleArray should be
rolled back or the tests should be modified so that they compile and succeed.

Thanks for pointing this out, this was a minor problem that was easily fixed by using the appropriate getValues method.

Second, I do not see the value in all of the additional classes and overhead introduced into stat. The goal of Univariate was to provide basic univariate statistics via a simple interface and lightweight, numerically sound implementation, consistent with the vision of commons-math and Jakarta Commons in general.

I am unclear how the work I have done is any less "lightweight" than anything else in the commons project (or the math project for that matter)? If organizing your classes into packages does not constitute "heavyness".

I fear that we may be straying off into statistical computation
framework-building, which I don't think belongs in commons-math (really Jakarta
Commons).

On the contrary, The "design" of the math package is whats in question here, this is about alternate opinions coming together to establish what an optimal design can be. Not that someone in this group is going to have to go off and write their own library because their vision is what another individual in the project likes, this is not a "one man" project.

More importantly, I don't think we need to add this complexity to
deliver the functionality that we are providing. The only problem that I see
with the structure prior to the recent commits is the confusion between
collections and univariates addValue methods.  I would favor eliminating the
List and BeanList univariates altogether and replacing their functionality with
methods added to StatUtils that take Lists or Collections and property names as
input and compute statistics from them. Similarly, the Univariate interface
could be modified to include addValues(double[]), addValues(List) (assumes
contents are Numbers), addValues(Collection, propertyName).

IMHO, This just creates an even more "monolithic" StatUtils class, every time we decide to work with another Collection or object type, are we going to have to implement another set of duplicate delegation methods in StatUtils? this doesn't seem very beneficial to me. I've already gone forward and implemented a new "MixedListUnivariate" implementation that works with heterogeneous objects (which can be mapped to double primitives with "NumberTransformer" objects, my next steps were to commit these to the project.

Unfortunately, I can tell the supporting work I've completed in this area will also be controversial with you as well as it involves some restructuring the Unviariate class hierarchy with the addition of an AbstractUnivariate class. Which I'm sure will receive objection as well. This saddens me because the work I am doing makes total sense to me conceptually. In an Object Oriented language, the tools we work with are objects. Its conceptually OO and not proceedural, Java is not Fortran.

The checkin comment says that the new univariate framework is independent of the existing implementations; but StatUtils has been modified to include numerous static data members and to delegate computation to these.

Yes, I committed the modifications to StatUtils some time after my commit of the framework. These were different commits with different purposes. If the StatUtils commit is premature it can be rolled back, but I would rather hear from other parties concerning the architecture and the changes before taking any backward steps. I tested the JUnit tests to verify that the changes did not create any inaccuracies. My implementations of the Statistics are based entirely on both the methods in Univariate and the previous StatUtils classes.

This adds significant overhead and I do not see the value in it. The cost of the additional stack operations/object creations is significant. I ran tests comparing the previous version that does direct computations using the double[] arrays to the modified version and found an average of more than 6x slowdown using the new implementation. I did not profile memory utilization, but that is also a concern. Repeated tests computing the mean of a 1000 doubles 100000 times using the old and new implementations averaged 1.5 and 10.2 seconds, resp. I do not see the need for all of this additional overhead.

If you review the code, you'll find there is no added "object creation", the static Variable objects calculate on double[] just as the Univariates did, I would have to see more substantial analysis to believe your claim. All thats going on here are that the Static StatUtil methods are delegating to individual static instances of UnivariateStatistics. These are instantiated on JVM startup like all static objects, calling a method in such an object should not require any more overhead than having the method coded directly into the static method.

If there are performance considerations, lets discuss these.

I doubt (as the numerous discussions over the past week have pointed out) that what we really want to have in StatUtils is one monolithic Static class with all the implemented methods present in it. If I have misinterpreted this opinion in the group, then I'm sure there will be responses to this.

I suggest that we postpone introduction of a statistical computation framework
until after the initial release, if needed.  In any case, I would like to keep
StatUtils and the core UnivariateImpl small, fast and lightweight, so I would
like to request that the changes to these classes be rolled back.

I would really like to see an architecture thats more than just on flat static class with a bunch of double[] methods in it. this is not very useful to me.

If others feel that this additional infrastructure is essential, then I just need to be educated. It is quite possible that I am thinking too narrowly in terms of current scope and I may be missing some looming structural problems. If this is the case, I am open to being educated. I just need to see a) exactly why we need to add more complexity at this time and b) why breaking univariate statistics into four packages and 17 classes when all we are computing is basic statistics is necessary.

The packages are categorical, the classes are implementations of each statistic. The framework provides an intuitive and organized means for others to easily implement and add statistics to the packages without being restricted to a fascist and monolithic Univariate interface or static StatUtils interface.

If anything the continued conflict between our two schools of thought shows the necessity of such an approach. Your school of thought can retain the monolithic Interfaces for "Univariate" and "StatUtil". While the framework can provide others with the ability to extend and expand the library without such "heavy handed" restrictions that cripple the extendability of the project.

There was a great deal of discussion about the benefit of not having the methods implemented directly in static StatUtils because they could not be "overridden" or worked with in an Instantiable form. This approach frees the implementations up to be overridden and frees up room for alternate implementations.

You may have your opinions of how you would like to see the packages organized and implemented. Others in the group do have alternate opinions to yours. I for one see a strong value in individually implemented Statistics. I also have a strong vision that the framework I have been working on provides substantial benefits.

(1a.) It Allows both the storageless and storage based implementations to function behind the same interface. No matter if your calling

increment(double d)

or

evaluate(double[]...)

your working with the same algorithm.

(1b.) If you wish to have alternate implementations for evaluate and increment, it is easily possible of overload theses methods in future versions of the implementations.

(2.) With individual Implementations, alternate approaches can be coded and included for the benefit of those who have an interest in such implementations. Thus there could be multiple versions of Variance, based on the strategy of interest and the numerical accuracy required.

(3.) Having the same implementations of statistics usable across all Univariate implementations assures a standard behavior and the same expected results no matter if your using incremental or evaluation based approaches.

(4.) The frame work provides a formal structure for the future growth of the library. Knowing what a UnviariateStatistic is, and seeing the various implementations, its obvious the route one will take to implement future statistics of interest.

Phil, its clear we have very different "schools of thought" on the subject of how the library should be designed. As a developer on the project I have a right to promote my design model and interests. The architecture is something I have a strong interest in working with.

Apache projects are "group" projects, If a project such a [math] cannot find community and room for multiple directions of development. If it cannot make room for alternate ideas and visions, if both revolutionary and evolutionary processes cannot coexist, I doubt the project will have much of a future at all.

-Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [math] Recent commits to stat, util packages

Reply via email to