Re: [sc-dev] Statistical Functions Implementation

Eike Rathke Wed, 15 Nov 2006 08:52:21 -0800

Hi Leonard,

On Wednesday, 2006-11-15 03:21:48 +0200, Leonard Mada wrote:


> 1. I have corrected some issues/bugs in the initial ANOVA code

Please note that before we can integrate code or data contributed we
need a signed Joint Copyright Assignment form (JCA) filled-out, see
http://contributing.openoffice.org/programming.html#jca

> //     CALCULATING ANOVA
> // (C) LEONARD MADA (2006)
> //
> //  THIS IS FREE SOFTWARE
> //
> // THIS SOFTWARE IS RELEASED UNDER BOTH
> // THE GNU GPL AS WELL AS THE LGPL

Note that when contributing code under the JCA such headers are without
effect and would be removed. Furthermore, this isn't Fortran ;-) and it
would be much more eye-friendly if you fixed your CapsLock key and
refrained from using only capitalized letters in comments. Please use
normal capitalization instead. Thanks.

> void ScInterpreter::ScANOVA()
> {
>       // WE GET EITHER A SINGLE MATRIX WHERE EVERY COLUMN IS A SEPARATE 
> VARIABLE
>       //    DISADVANTAGE: ONLY ONE COLUMN PER VARIABLE
>       // OR MULTIPLE MATRICES, EACH MATRIX IS ONE VARIABLE
>       //    DISADVANTAGE:
>       //      CALC FUNCTIONS ACCEPT ONLY 30 PARAMS
>       //      SO THERE ARE AT MOST 30 VARIABLES

Not quite true. The UI in the Formula AutoPilot knows only 30 parameters
at maximum, the compiler and interpreter actually can handle more. The
theoretical maximum currently is 256 due to the BYTE variable used as
parameter count. But designing the parameters is more a question of how
other spreadsheet applications do it. We should follow that.

For passing area references or arrays/matrices you may also want to take
a look at other functions that handle multiple ranges like SUM() or
AVERAGE(), implemented in interpr1.cxx method
ScInterpreter::IterateParameters(). The number of elements in a matrix
is limited, and obtaining a matrix from a parameter that is passed as an
area reference is additional unnecessary overhead.

>       SCSIZE iVarNr   = GetByte() /* NUMBER OF PARAMETERS */;
> [...]
>       ScMatrixRef pMat[iVarNr]; // INITIALISE DATA-MATRICES
>       SCSIZE nR[iVarNr], nC[iVarNr]; // INITIALISE MATRIX DIMENSIONS

This actually doesn't work. Automatically allocating variable-length
arrays on the stack is a GCC extension and doesn't work with all
compilers. Instead it should be

    ScMatrixRef pMat = new ScMatrixRef[iVarNr];

>       SCSIZE nMAX = 0; // MAX DIMENSION OF ANY OF THE MATRICES
>       for(SCSIZE i= 0; i<iVarNr; i++) {
>               ScMatrixRef pMat[i] = GetMatrix();
>               if (!pMat[i]) {
>                       // NO DATA MATRIX - INVALID PARAMETERS
>                       SetIllegalParameter();

Here we need to delete the new'ed pMat array before returning from the
method. Or better, use std::auto_ptr from the beginnning, that does it
automatically.

>                       return;
> [...]
>       if( nMAX == 1 ) {
>               SetNoValue();

Same here, delete new'ed memory.

>       double fSumX[iVarNr] ; // THIS STORES THE MEANS FOR THE INDIVIDUAL 
> VARIABLES
>       double fValX[iVarNr] [nMAX]; // THE VALUES

Also this doesn't work, the arrays need to be dynamically allocated.
I also don't see why we need  fValX[iVarNr][nMAX]  elements, where maybe
most of it will be unused if only one matrix has nMAX elements. I think
this can be much improved by storing just the needed elements.
Furthermore, at least in this matrx scenario, to me it looks like
copying the single matrix elements to fValX[][] is not necessary at all
since the matrix has it all. However, if changed to area reference
iteration this may be necessary to avoid iterating twice. But then
taking care of the memory allocation footprint would be even more
important.

>               fSumX[iCount] = fSumX[iCount] / jCount; // THIS IS THE MEAN
>               nR[iCount] = jCount; // STORE HOW MANY DATA VALUES EXIST FOR 
> THIS VAR
>               N += jCount; // ADD FOR GRAND TOTAL = TOTAL No OF DATA VALUES
>               jCount = 0;  // RESET jCount FOR NEXT VARIABLE

Btw, code is much more readable if you align the trailing comments (and
use proper capitalization, of course ;-)

        fSumX[iCount] = fSumX[iCount] / jCount; // This is the mean.
        nR[iCount] = jCount;                    // Store how many data values 
exist for this var.
        N += jCount;                            // Add for grand total = total 
no of data values.
        jCount = 0;                             // Reset jCount for next 
variable.


> void ScInterpreter::ScANOVAMono()
> {
>       // WE GOT A SINGLE MATRIX WHERE EVERY COLUMN IS A SEPARATE VARIABLE
>       //    DISADVANTAGE: ONLY ONE COLUMN PER VARIABLE
>       //    BUT IT IS EASYER TO USE AND IT IS NOT LIMITED TO 30 VARIABLES

I probably would avoid the duplicated implementation and make this
a special case of the general ANOVA instead. However, also here it is
more a matter of how other spreadsheet programs handle the parameters.

  Eike

-- 
 OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer.
 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
 Please don't send personal mail to the [EMAIL PROTECTED] account, which I use 
for
 mailing lists only and don't read from outside Sun. Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [sc-dev] Statistical Functions Implementation

Reply via email to