Re: [sc-dev] Statistical Functions Implementation

2006-11-10 Thread Leonard Mada

Hi to everyone.

I wish to discuss the following points:
1.) NEED HELP to compile
2.) ANOVA

2.) ANOVA
=

I will start with point 2. Should I open a new thread for this, by the way?

I have appended a draft implementation of the *ANOVA function*, please 
see the code. There are still some unfinished issues, BUT I need help 
for those.


- the ANOVA code proper is finished
- depending on how to implement the variable arrays, it  might become 
pertinent to keep track of array length

  -- IF implementing as vectors, NO such need
- some issues at initializing the function: like retrieving how many 
parameters and the like;

- I will discuss in a later post my ideas on how to implement the output

ANOVA is probably one of the last functions that could be implemented 
directly in Calc (Fisher exact test would be another one). All other 
advanced statistical functions should be best implemented through 
external software (e.g. R).


1.) COMPILING


Niklas Nebel wrote:
OOo can now be built on Windows using only free tools. Noel just 
posted something at 
http://noelpower.blogs.ie/2006/11/10/hurray-now-you-can-build-with-free-compiler-on-windows/. 
I haven't tried it myself, though.


I have NO idea how this goes. I need help. As I previously mentioned, I 
am NOT an IT professional. It just happens that I know some C++.


If anybody can compile this, I would be very thankful. It really 
corrects a bug inside Calc.



Kind regards,

Leonard Mada
void ScInterpreter::ScANOVA()
{
// WE GET EITHER A SINGLE MATRIX WHERE EVERY COLUMN IS A SEPARATE 
VARIABLE
//   DISADVANTAGE: ONLY ONE COLUMN PER VARIABLE
// OR MULTIPLE MATRICES, EACH MATRIX IS ONE VARIABLE
//   DISADVANTAGE:
//  CALC FUNCTIONS ACCEPT ONLY 30 PARAMS
//  SO THERE ARE AT MOST 30 VARIABLES

SCSIZE iVarNr   = /* NUMBER OF PARAMETERS */; // NUMBER OF VARIABLES
if ( iVarNr == 0 /* NO PARAMETERS */)
return; // EXIT
if ( iVarNr == 1 /* ONLY ONE PARAMETER */ )
ScMatrixRef pMat = GetMatrix();
if (!pMat) {
// NO DATA MATRIX - INVALID PARAMETERS
SetIllegalParameter();
return; }
SCSIZE nC, nR;
// WE HAVE ONLY ONE MATRIX
// WE CONSIDER EVERY COLUMN AS A SEPARATE DATA SET
pMat->GetDimensions(nC, nR);
iVarNr = nC;
nC = 1; // NOT REALLY NEEDED
ScMatrixRef pMat[iVarNr];
for(size_t i=0; iGetDimensions(nC, nR);
fSumX[iCount] = 0.0; // INITIALIZE THE SUM
for (SCSIZE i = 0; i < nC; i++)
for (SCSIZE j = 0; j < nR; j++)
{
if (!pMat[iCount]->IsString(i,j))
{
fValX[iCount][jCount] = 
pMat[iCount]->GetDouble(i,j);
fSumX[iCount]+= 
fValX[iCount][jCount];
jCount++;
}
}
fSumM += fSumX[iCount];
fSumX[iCount] = fSumX[iCount] / jCount; // THIS IS THE MEAN
N += jCount;
jCount = 0; // RESET jCount FOR NEXT VARIABLE
} // END OUTER FOR LOOP

if (iCount < 2)
SetNoValue();
else {
dfB = iCount -1;
dfE = N - iCount;
fSumM = fSumM / N; // THIS IS THE GRAND MEAN

for(SCSIZE i = 0; i < iCount; i++) {
for(SCSIZE j = 0; j < /* INDIVIDUAL GROUP SIZE */; j++) 
{
// GROUPS MAY HAVE DIFFERENT SIZES
fMSE += (fValX[iCount][jCount] - fSumX[iCount]) 
* (fValX[iCount][jCount] - fSumX[iCount]);
// fMSB += (fSumM - fSumX[iCount]) * (fSumM - 
fSumX[iCount]);
// TO AVOID MORE COMPUTATIONS WE CAN CALCULATE 
fMSB OUTSIDE THIS LOOP
}
fMSB += /* INDIVIDUAL GROUP SIZE */ * (fSumM - 
fSumX[iCount]) * (fSumM - fSumX[iCount]);
}
fMSB = fMSB / dfB;
fMSE = fMSE / dfE;
PushDouble( fMSB/fMSE );
// WE STILL NEED TO INTERPRET fMSB/fMSE USING THE F STATISTICS
}
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [sc-dev] Statistical Functions Implementation

2006-11-10 Thread Niklas Nebel

Leonard Mada wrote:

I wish to mention the following 3 issues:
1. compiling and testing the new algorithm
2. "void ScInterpreter::ScCorrel()" AND "void 
ScInterpreter::ScPearson()" seem to be identical
  -- I have found NO difference in the documentation for CORREL() and 
PEARSON(), but I may have overlooked something
  -- IF they are truly identical, than calling within ::ScPearson() the 
ScCorrel() function would be OK

  -- NO need to maintain 2 times an identical code


No, you're right. That should be unified. Well spotted.

3. other functions seem/are broken, like "void ScInterpreter::ScCovar()" 
AND "void ScInterpreter::ScRSQ()",...

  -- would need to be fixed, too


Sure. Most of them are still the initial implementation. Only a few 
functions have ever been updated.


Ok, then everything is fine. Can somebody compile this? I do not really 
know how to (or have the utilities to) compile it. Before integrating 
this into an official OOo, I believe it should be extensively tested.


OOo can now be built on Windows using only free tools. Noel just posted 
something at 
http://noelpower.blogs.ie/2006/11/10/hurray-now-you-can-build-with-free-compiler-on-windows/. 
I haven't tried it myself, though.


Is it possible to compile only a dll so that I have only to replace one 
file in my official OOo. (I have NO idea if this file gets compiled as a 
separate dll, but that would be really helpful.)


Built-in functions are in the main Calc dll, sc680mi.dll on Windows.

Niklas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sc-dev] Statistical Functions Implementation

2006-11-10 Thread Leonard Mada

Hi Eike,

I wish to mention the following 3 issues:
1. compiling and testing the new algorithm
2. "void ScInterpreter::ScCorrel()" AND "void 
ScInterpreter::ScPearson()" seem to be identical
  -- I have found NO difference in the documentation for CORREL() and 
PEARSON(), but I may have overlooked something
  -- IF they are truly identical, than calling within ::ScPearson() the 
ScCorrel() function would be OK

  -- NO need to maintain 2 times an identical code
3. other functions seem/are broken, like "void ScInterpreter::ScCovar()" 
AND "void ScInterpreter::ScRSQ()",...

  -- would need to be fixed, too

I wish to discuss only shortly issue 1:

Eike Rathke wrote:

...
Noticed.

  
3. the x and y values (fValX[count] and fValY[count]) must be stored, so 
we have to define (variable) arrays. I do not know which method is best 
suited/ will affect speed less. I usually prefer vectors when dealing 
with such a situation, but might be too much for this one, especially if 
we do NOT do any sorting. Size is unfortunately not known beforehand.



A maximum size is known: there can't be more than nC1*nR1 elements, so
pre-allocating new double[nC1*nR1] is fine.
  


Ok, then everything is fine. Can somebody compile this? I do not really 
know how to (or have the utilities to) compile it. Before integrating 
this into an official OOo, I believe it should be extensively tested.


Is it possible to compile only a dll so that I have only to replace one 
file in my official OOo. (I have NO idea if this file gets compiled as a 
separate dll, but that would be really helpful.)


Although this algorithm is more accurate than the original naive 
algorithm, it necessitates 2 passes, so it is undoubtedly more slow. The 
Wikipedia describes a one pass robust algorithm. While that would 
execute slightly more rapid, it is (probably) less accurate. Also, even 
this algorithm may fail in special situations (therefore the need for 
the test condition in the code). Unlike the STD DEV, I have really no 
solid mathematical background how robust every implementation is.


Kind regards,

Leonard Mada

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sc-dev] Statistical Functions Implementation

2006-11-10 Thread Eike Rathke
Hi Leonard,

On Friday, 2006-11-10 18:32:23 +0200, Leonard Mada wrote:

> 1. I forget a  "count++;"  in the first for LOOP
> [inside the:  if (!pMat1->IsString(i,j) && !pMat2->IsString(i,j))  {}  body]
> 
> 2. because this count would be from 1 to n, the 2nd for LOOP should be 
> modified accordingly:
> for(j = 0; j < count; j++) { // NOT j <= count

Noticed.

> 3. the x and y values (fValX[count] and fValY[count]) must be stored, so 
> we have to define (variable) arrays. I do not know which method is best 
> suited/ will affect speed less. I usually prefer vectors when dealing 
> with such a situation, but might be too much for this one, especially if 
> we do NOT do any sorting. Size is unfortunately not known beforehand.

A maximum size is known: there can't be more than nC1*nR1 elements, so
pre-allocating new double[nC1*nR1] is fine.

  Eike

-- 
 OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer.
 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
 Please don't send personal mail to the [EMAIL PROTECTED] account, which I use 
for
 mailing lists only and don't read from outside Sun. Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sc-dev] Statistical Functions Implementation

2006-11-10 Thread Eike Rathke
Hi Leonard,

On Friday, 2006-11-10 17:13:00 +0200, Leonard Mada wrote:

> unsigned int count = 0; // Counter for values
> // DO WE NEED AN ??? fCount ???

Nah.. there won't be more than 2^32 matrix elements..

> OR is (unsigned int) count OK

size_t respectively SCSIZE is preferred instead for clear semantics.

  Eike

-- 
 OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer.
 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
 Please don't send personal mail to the [EMAIL PROTECTED] account, which I use 
for
 mailing lists only and don't read from outside Sun. Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sc-dev] Statistical Functions Implementation

2006-11-10 Thread Leonard Mada

A small correction to my previous algorithm:

1. I forget a  "count++;"  in the first for LOOP
[inside the:  if (!pMat1->IsString(i,j) && !pMat2->IsString(i,j))  {}  body]

2. because this count would be from 1 to n, the 2nd for LOOP should be 
modified accordingly:

for(j = 0; j < count; j++) { // NOT j <= count

3. the x and y values (fValX[count] and fValY[count]) must be stored, so 
we have to define (variable) arrays. I do not know which method is best 
suited/ will affect speed less. I usually prefer vectors when dealing 
with such a situation, but might be too much for this one, especially if 
we do NOT do any sorting. Size is unfortunately not known beforehand.


Kind regards,

Leonard Mada

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sc-dev] Statistical Functions Implementation

2006-11-10 Thread Leonard Mada

Niklas Nebel wrote:

sc/source/core/tool/interpr3.cxx, method ScInterpreter::ScCorrel.


Calc uses indeed the *naïve algorithm* to compute the Correlation 
Coefficient (see Wikipedia).



if (!pMat1->IsString(i,j) && !pMat2->IsString(i,j))
{
fValX = pMat1->GetDouble(i,j);
fValY = pMat2->GetDouble(i,j);
fSumX += fValX;
fSumSqrX += fValX * fValX;
fSumY += fValY;
fSumSqrY += fValY * fValY;
fSumXY += fValX*fValY;
fCount++;
}


The rewritten algorithm would look like:

unsigned int count = 0; // Counter for values
// DO WE NEED AN ??? fCount ??? OR is (unsigned int) count OK
double fMeanX = 0.0;
double fMeanY = 0.0;

for (SCSIZE j = 0; j < nR1; j++) {
if (!pMat1->IsString(i,j) && !pMat2->IsString(i,j))
{
fValX[count] = pMat1->GetDouble(i,j);
fValY[count] = pMat2->GetDouble(i,j);
fMeanX += fValX[count];
fMeanY += fValY[count];
// ALTERNATIVELY SORT FIRST X AND Y AND
// CALCULATE THE MEANS IN A SEPARATE LOOP
// FOR GREATER ACCURACY
}
}

if (count < 2)
SetNoValue();
else {
fMeanX = fMeanX/count;
fMeanY = fMeanY/count;

for(j = 0; j <= count; j++) {
fSum += (fValX[j] – fMeanX) * (fValY[j] – fMeanY);
fSDX += (fValX[j] – fMeanX) * (fValX[j] – fMeanX);
fSDY += (fValY[j] – fMeanY) * (fValY[j] – fMeanY);
}

fVal = fSum / (SQRT(fSDX * fSDY) );
if( (fVal >= -1.0) && (fVal <= 1.0) ) {
PushDouble( fSum / (SQRT(fSDX * fSDY) ) ); }
else {
// REPORT AN ERROR: INVALID VALUE
// ALGORITHM FAILED
}
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sc-dev] Statistical Functions Implementation

2006-11-10 Thread Niklas Nebel

Leonard Mada wrote:

Mind if I ask, in which file is the CORREL function defined?
Tests are still useful to see where the algorithm fails (and 
purportedly, the new algorithm should be correct). Extensive test cases 
are always welcome.


sc/source/core/tool/interpr3.cxx, method ScInterpreter::ScCorrel.

I also suggest that one of the developers sits down for half an hour and 
writes a TEXT file containing:

- statistical function name
- source file where it is defined
- it would be of great help for others not knowing the OOo source code 
well, where to look for


Find the function name in sc/source/core/src/compiler.src, the 
corresponding OpCode value in sc/inc/opcode.hxx, and look in 
ScInterpreter::Interpret (sc/source/core/tool/interpr4.cxx) which method 
is called for the OpCode. Add-In functions are separate, but that's how 
you can find any built-in function's implementation.


Niklas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sc-dev] Statistical Functions Implementation

2006-11-10 Thread Leonard Mada

Niklas Nebel wrote:
They don't "accept" a value of -0.999 instead of 1, they just 
calculated something different (autocorrelation).


Indeed, that was autocorrelation. My fault. But the test case is still a 
beautiful one for CORRELATION. See also my other 2 test cases. Calc 
fails quite badly and this should be fixed by 2.05!!! IF the 
autocorrelation should be -0.999, then the simpler correlation between x 
and x+10 should be at least as accurate, which was definitely not the 
case. (By the way, does OOo have an autocorrelation function; I did not 
find one.)


I performed the tests in gnumeric, too, and it seems gnumeric is 
accurate. (Well, be aware that if you open the .ods files, it would NOT 
recalculate the formulas.)


I'm not sure what you're trying to achieve here. A single look at the 
source would show you how CORREL is implemented using sums and square 
sums, which limits the values for which it works. There's no need for 
more example guesswork.


Mind if I ask, in which file is the CORREL function defined?
Tests are still useful to see where the algorithm fails (and 
purportedly, the new algorithm should be correct). Extensive test cases 
are always welcome.


I also suggest that one of the developers sits down for half an hour and 
writes a TEXT file containing:

- statistical function name
- source file where it is defined
- it would be of great help for others not knowing the OOo source code 
well, where to look for


Kind regards,

Leonard Mada

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sc-dev] Statistical Functions Implementation

2006-11-10 Thread Niklas Nebel

Leonard Mada wrote:
I entered in column B(1): "=A1 + 1", that is I added the value "1" for 
every element in column A. That said, data values in columns A and B 
should be totally correlated. The NIST accepts a value of -0.999, but:


They don't "accept" a value of -0.999 instead of 1, they just calculated 
something different (autocorrelation).



- Calc gives only a value of -0.882.
- R gives an absolute precise value of +1 (it reports positive values ) 
both with the cor(x,y) function and with the cor.test(x, y) function


Not even the slightest deviation from this value. This should be 
definitely improved in Calc.


NOTE: sorting the values in Calc, breaks CORREL() [it gives an #VALUE 
error, NO idea why]. Well, if I pair the non-ordered data in A with an 
ordered (A+1), I see a very strange result: CORREL() gives -1.01, BUT 
this coefficient CAN BE only between -1 and +1!!! SOME SERIOUS ERROR.


I will try to make more tests during the weekend.


I'm not sure what you're trying to achieve here. A single look at the 
source would show you how CORREL is implemented using sums and square 
sums, which limits the values for which it works. There's no need for 
more example guesswork.


Niklas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]