Re: [math] Recent commits to stat, util packages

Mark R. Diggory Sat, 05 Jul 2003 23:58:53 -0700

Phil Steitz wrote:

Sorry, last reply got sent before I was done with it. Pls disregard and try this....
This adds
significant overhead and I do not see the value in it.  The cost of the
additional stack operations/object creations is significant.  I ran tests
comparing the previous version that does direct computations using the
double[]

arrays to the modified version and found an average of more than 6x
slowdown

using the new implementation. I did not profile memory utilization, but
that is

also a concern. Repeated tests computing the mean of a 1000 doubles 100000 times using the old and new implementations averaged 1.5 and 10.2 seconds, resp. I do not see the need for all of this additional overhead.

If you review the code, you'll find there is no added "object creation", the static Variable objects calculate on double[] just as the Univariates did, I would have to see more substantial analysis to believe your claim. All thats going on here are that the Static StatUtil methods are delegating to individual static instances of UnivariateStatistics. These are instantiated on JVM startup like all static objects, calling a method in such an object should not require any more overhead than having the method coded directly into the static method.

If there are performance considerations, lets discuss these.
Here is what I added to StatUtils.test

double[] x = new double[1000]; for (int i = 0; i < 1000; i++) { x[i] = (5 - i) * (i - 200); } long startTick = 0; double res = 0; for (int j = 0; j < 10; j++) { startTick = System.currentTimeMillis(); for (int i = 0; i < 100000; i++) { res = OStatUtils.mean(x); } System.out.println("old: " + (System.currentTimeMillis() - startTick)); startTick = System.currentTimeMillis(); for (int i = 0; i < 100000; i++) { res = StatUtils.mean(x); } System.out.println("new: " + (System.currentTimeMillis() - startTick));

The result was a mean of 10203 for the "new" and 1531.1 for the "old", with standard deviations 81.1 and 13.4 resp. The overhead is the stack operations and temp object creations.

Ok, yes, You've got me on this one, I ran these tests and your correct, the increment method approach (while great for storageless implementation) does incur the added costs in calculation (specifically in added divisions that are ocuring). It would have been better for me to still provide the implementations provided for in the static StatUtil lib for the double[] based methods.

*the direct evaluation approach here*

public double evaluate(double[] d, int start, int length) {
   double accum = 0.0;
   for (int i = start; i < start + length; i++) {
       accum += d[i];
   }
 return accum / (double) d.length;
}

*takes cycles to calculate than the below incremental approach*

int n = 0;
double m1 = Double.NaN;

public double evaluate(double[] d, int start, int length) {
    for (int i = start; i < start + length; i++) {
        increment(d[i]);
    }
    return getValue();
}

public double increment(double d) {
    if (n < 1) {
        m1 = 0.0;
    }
    n++;
    m1 += (d - m1) / ((double) n);
    return m1;
}

I will add the direct approaches into my implementations I have written to regain this efficiency. I'll also roll back StatUtils so that it is not dependent on these limitations in the mean-time.

I doubt (as the numerous discussions over the past week have pointed out) that what we really want to have in StatUtils is one monolithic Static class with all the implemented methods present in it. If I have misinterpreted this opinion in the group, then I'm sure there will be responses to this.
Well, I for one would prefer to have the simple computational methods in one
place.  I would support making the class require instantiation, however, i.e.
making the methods non-static.

Yes, but again is a question of having big flat monolithic classes vs having extensible implementations that can easily be expanded on. I'm not particularly thrilled at the idea of being totally locked into such an interface like Univariate or StatUtils. It is just totally inflexible and there always too much restriction and argument about what do we want to put in it vs, not put in it.

There was a great deal of discussion about the benefit of not having the methods implemented directly in static StatUtils because they could not be "overridden" or worked with in an Instantiable form. This approach frees the implementations up to be overridden and frees up room for alternate implementations.
As I said above, the simplest way to deal with this is to make the methods
non-static.

Yes, simple, but not very organized, and not as extensible as a framework like "solvers" is. You can implement any new "solver" we could desire right now without much complaint, but try to implement a new statistic and blam, all this argument starts up as to whether its appropriate or not in the Univariate interface. There's not room for growth here! If I decide to go down the road an try to implement things like auto-correlation coefficients (which would be a logical addition someday) then I end up having to get permission just to "add" the implementation, whereas if there's a logical framework, theres more room for growth without stepping on each others toes so much. This is very logical to me.

You may have your opinions of how you would like to see the packages organized and implemented. Others in the group do have alternate opinions to yours. I for one see a strong value in individually implemented Statistics. I also have a strong vision that the framework I have been working on provides substantial benefits.

(1a.) It Allows both the storageless and storage based implementations to function behind the same interface. No matter if your calling

increment(double d)

or

evaluate(double[]...)

your working with the same algorithm.
That is true in the old implementation as well, with the core computational
methods in StatUtils.

No, in the original implementation "incremental" approaches are different implementations than "evaluation" double[] approaches, as we've seen in the case above. The trade off is accuracy vs. efficiency. In the old implementations case the incremental's are in the UnivaraiteImpl while the evaluation strategies are in StatUtils (and currently duplicated in StoreUnivariateImpl.

(1b.) If you wish to have alternate implementations for evaluate and increment, it is easily possible of overload theses methods in future versions of the implementations.
Just make the methods non-static and that will be possible.  I am not sure,
given the relative triviality of these methods, if this is really a big deal,
howerver.

Then we're back to dependency issues where now theres "another" interface that is restrictive and difficult to expand upon easily, it will be hard to add things to the library because everyone will be arguing about what should/shouldn't be in interface, uughh. :-(

I am becoming more and more against having these "generic" Univariate interfaces where a particular statistic is embodied in a "method". every time someone comes up with a "new method" there will be debate about should it be in the interface or not. Instead of just being an additional class that can be added to the package. This is the benefit of a framework over monolithic interfaces.

Phil, its clear we have very different "schools of thought" on the subject of how the library should be designed. As a developer on the project I have a right to promote my design model and interests. The architecture is something I have a strong interest in working with.
You certainly have the right to your opinions.  Others also have the right to
disagree with them.
Apache projects are "group" projects, If a project such a [math] cannot find community and room for multiple directions of development. If it cannot make room for alternate ideas and visions, if both revolutionary and evolutionary processes cannot coexist, I doubt the project will have much of a future at all.

I agree with this as well; but from what I have observed, open source projects do best when they do not try to go off in divergent directions at the same time. If we cannot agree on a consistent architecture direction, then I don't think we will succeed. If we can and we stay focused, then we will.

I don't think trying to come up with the best design for the library equates very well to "being unfocused".

> As I said

above, if others agree with the approach that you want to take, then that is
the direction that the project will go.  I am interested in the opinions of
Tim, Robert and the rest of the team.

Phil

I am interested as well in what they have to say.

-Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [math] Recent commits to stat, util packages

Reply via email to