Phil Steitz wrote:
Ken Geis wrote:

I'm playing with commons-math to implement a data mining algorithm and I am having a performance problem.

I am doing running statistics over an ordered set of data, storing the statistics at each new value I come across. One way of doing this would be to have an array of SummaryStatistics and do

for (int i = 0; i < length; i++)
{
    for (int j = i; j < length; j++)
    {
        statsArray[j].addValue(values[i]);
    }
}

another way is to do

for (int i = 0; i < length; i++)
{
    stats.addValue(values[i]);
    statsArray[i] = SerializationUtils.clone(stats);
}

A lot of these objects are marked Serializable, but clone methods do not exist. That's why I use commons-lang SerializationUtils. Unfortunately, that makes the cloning take up 50% of my runtime because (de)serialization is expensive.

I will probably patch the statistics classes, implementing enough of clone() to make me happy. Would you like this patch?


An efficient cloning method might be useful, but it would still carry around extra baggage and overhead for your use case (SummaryStatisticsImpl nests a bunch of little stats objects and other instance data).

What we might want to do is add a StatisticalSummaryBean, implementing the StatisticalSummary interface and add a getSummary method to SummaryStatistics returning an instance of this "value bean" containing only the values of the statistics. Then you could just do

for (int i = 0; i < length; i++)
 {
     stats.addValue(values[i]);
     statsArray[i] = stats.getSummary();
 }

since I presume that all you will want to do with statsArray[i] is things like getMean(), getVariance(), etc. This would require much less overhead than cloning the whole SummaryStatisticsImpl instance each time.

Since this would amount to a change to the SummaryStatistics interface, if we want to do it, we should do it now, before 1.0. I am +1 to this change and willing to implement it if no one objects.

Phil

Phil,


What would a Summary object contain? I suspect it would contain a double value for each statistic (mean, var ...)? I assume a summary is read only? I get the feeling this is something that would contain various methods to return the statistical values. The ultimate debate arises, "Which statistics to place in it?" In such a case different users have different needs. If all we are talking about is an object that contains the current state of the SummaryStatisticsImpl, I'll argue this should be an exercise left up to the user.

Ken,

I guess I am a little confused by the purpose of all this cloning? Are you attempting to maintain instantaneous statistics up to that point in the array? Are the copies in the statsArray ever incremented, or are they just replaced by new copies?

I'm still interested in well established specs for cloning and serialization on these objects. We need to be able to persist and duplicate without any resulting errors.

-Mark
--
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to