Re: [sc-dev] Statistical Functions Implementation
Hi to everyone. I wish to discuss the following points: 1.) NEED HELP to compile 2.) ANOVA 2.) ANOVA = I will start with point 2. Should I open a new thread for this, by the way? I have appended a draft implementation of the *ANOVA function*, please see the code. There are still some unfinished issues, BUT I need help for those. - the ANOVA code proper is finished - depending on how to implement the variable arrays, it might become pertinent to keep track of array length -- IF implementing as vectors, NO such need - some issues at initializing the function: like retrieving how many parameters and the like; - I will discuss in a later post my ideas on how to implement the output ANOVA is probably one of the last functions that could be implemented directly in Calc (Fisher exact test would be another one). All other advanced statistical functions should be best implemented through external software (e.g. R). 1.) COMPILING Niklas Nebel wrote: OOo can now be built on Windows using only free tools. Noel just posted something at http://noelpower.blogs.ie/2006/11/10/hurray-now-you-can-build-with-free-compiler-on-windows/. I haven't tried it myself, though. I have NO idea how this goes. I need help. As I previously mentioned, I am NOT an IT professional. It just happens that I know some C++. If anybody can compile this, I would be very thankful. It really corrects a bug inside Calc. Kind regards, Leonard Mada void ScInterpreter::ScANOVA() { // WE GET EITHER A SINGLE MATRIX WHERE EVERY COLUMN IS A SEPARATE VARIABLE // DISADVANTAGE: ONLY ONE COLUMN PER VARIABLE // OR MULTIPLE MATRICES, EACH MATRIX IS ONE VARIABLE // DISADVANTAGE: // CALC FUNCTIONS ACCEPT ONLY 30 PARAMS // SO THERE ARE AT MOST 30 VARIABLES SCSIZE iVarNr = /* NUMBER OF PARAMETERS */; // NUMBER OF VARIABLES if ( iVarNr == 0 /* NO PARAMETERS */) return; // EXIT if ( iVarNr == 1 /* ONLY ONE PARAMETER */ ) ScMatrixRef pMat = GetMatrix(); if (!pMat) { // NO DATA MATRIX - INVALID PARAMETERS SetIllegalParameter(); return; } SCSIZE nC, nR; // WE HAVE ONLY ONE MATRIX // WE CONSIDER EVERY COLUMN AS A SEPARATE DATA SET pMat->GetDimensions(nC, nR); iVarNr = nC; nC = 1; // NOT REALLY NEEDED ScMatrixRef pMat[iVarNr]; for(size_t i=0; iGetDimensions(nC, nR); fSumX[iCount] = 0.0; // INITIALIZE THE SUM for (SCSIZE i = 0; i < nC; i++) for (SCSIZE j = 0; j < nR; j++) { if (!pMat[iCount]->IsString(i,j)) { fValX[iCount][jCount] = pMat[iCount]->GetDouble(i,j); fSumX[iCount]+= fValX[iCount][jCount]; jCount++; } } fSumM += fSumX[iCount]; fSumX[iCount] = fSumX[iCount] / jCount; // THIS IS THE MEAN N += jCount; jCount = 0; // RESET jCount FOR NEXT VARIABLE } // END OUTER FOR LOOP if (iCount < 2) SetNoValue(); else { dfB = iCount -1; dfE = N - iCount; fSumM = fSumM / N; // THIS IS THE GRAND MEAN for(SCSIZE i = 0; i < iCount; i++) { for(SCSIZE j = 0; j < /* INDIVIDUAL GROUP SIZE */; j++) { // GROUPS MAY HAVE DIFFERENT SIZES fMSE += (fValX[iCount][jCount] - fSumX[iCount]) * (fValX[iCount][jCount] - fSumX[iCount]); // fMSB += (fSumM - fSumX[iCount]) * (fSumM - fSumX[iCount]); // TO AVOID MORE COMPUTATIONS WE CAN CALCULATE fMSB OUTSIDE THIS LOOP } fMSB += /* INDIVIDUAL GROUP SIZE */ * (fSumM - fSumX[iCount]) * (fSumM - fSumX[iCount]); } fMSB = fMSB / dfB; fMSE = fMSE / dfE; PushDouble( fMSB/fMSE ); // WE STILL NEED TO INTERPRET fMSB/fMSE USING THE F STATISTICS } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sc-dev] Statistical Functions Implementation
Leonard Mada wrote: I wish to mention the following 3 issues: 1. compiling and testing the new algorithm 2. "void ScInterpreter::ScCorrel()" AND "void ScInterpreter::ScPearson()" seem to be identical -- I have found NO difference in the documentation for CORREL() and PEARSON(), but I may have overlooked something -- IF they are truly identical, than calling within ::ScPearson() the ScCorrel() function would be OK -- NO need to maintain 2 times an identical code No, you're right. That should be unified. Well spotted. 3. other functions seem/are broken, like "void ScInterpreter::ScCovar()" AND "void ScInterpreter::ScRSQ()",... -- would need to be fixed, too Sure. Most of them are still the initial implementation. Only a few functions have ever been updated. Ok, then everything is fine. Can somebody compile this? I do not really know how to (or have the utilities to) compile it. Before integrating this into an official OOo, I believe it should be extensively tested. OOo can now be built on Windows using only free tools. Noel just posted something at http://noelpower.blogs.ie/2006/11/10/hurray-now-you-can-build-with-free-compiler-on-windows/. I haven't tried it myself, though. Is it possible to compile only a dll so that I have only to replace one file in my official OOo. (I have NO idea if this file gets compiled as a separate dll, but that would be really helpful.) Built-in functions are in the main Calc dll, sc680mi.dll on Windows. Niklas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sc-dev] Statistical Functions Implementation
Hi Eike, I wish to mention the following 3 issues: 1. compiling and testing the new algorithm 2. "void ScInterpreter::ScCorrel()" AND "void ScInterpreter::ScPearson()" seem to be identical -- I have found NO difference in the documentation for CORREL() and PEARSON(), but I may have overlooked something -- IF they are truly identical, than calling within ::ScPearson() the ScCorrel() function would be OK -- NO need to maintain 2 times an identical code 3. other functions seem/are broken, like "void ScInterpreter::ScCovar()" AND "void ScInterpreter::ScRSQ()",... -- would need to be fixed, too I wish to discuss only shortly issue 1: Eike Rathke wrote: ... Noticed. 3. the x and y values (fValX[count] and fValY[count]) must be stored, so we have to define (variable) arrays. I do not know which method is best suited/ will affect speed less. I usually prefer vectors when dealing with such a situation, but might be too much for this one, especially if we do NOT do any sorting. Size is unfortunately not known beforehand. A maximum size is known: there can't be more than nC1*nR1 elements, so pre-allocating new double[nC1*nR1] is fine. Ok, then everything is fine. Can somebody compile this? I do not really know how to (or have the utilities to) compile it. Before integrating this into an official OOo, I believe it should be extensively tested. Is it possible to compile only a dll so that I have only to replace one file in my official OOo. (I have NO idea if this file gets compiled as a separate dll, but that would be really helpful.) Although this algorithm is more accurate than the original naive algorithm, it necessitates 2 passes, so it is undoubtedly more slow. The Wikipedia describes a one pass robust algorithm. While that would execute slightly more rapid, it is (probably) less accurate. Also, even this algorithm may fail in special situations (therefore the need for the test condition in the code). Unlike the STD DEV, I have really no solid mathematical background how robust every implementation is. Kind regards, Leonard Mada - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sc-dev] Statistical Functions Implementation
Hi Leonard, On Friday, 2006-11-10 18:32:23 +0200, Leonard Mada wrote: > 1. I forget a "count++;" in the first for LOOP > [inside the: if (!pMat1->IsString(i,j) && !pMat2->IsString(i,j)) {} body] > > 2. because this count would be from 1 to n, the 2nd for LOOP should be > modified accordingly: > for(j = 0; j < count; j++) { // NOT j <= count Noticed. > 3. the x and y values (fValX[count] and fValY[count]) must be stored, so > we have to define (variable) arrays. I do not know which method is best > suited/ will affect speed less. I usually prefer vectors when dealing > with such a situation, but might be too much for this one, especially if > we do NOT do any sorting. Size is unfortunately not known beforehand. A maximum size is known: there can't be more than nC1*nR1 elements, so pre-allocating new double[nC1*nR1] is fine. Eike -- OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer. OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS Please don't send personal mail to the [EMAIL PROTECTED] account, which I use for mailing lists only and don't read from outside Sun. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sc-dev] Statistical Functions Implementation
Hi Leonard, On Friday, 2006-11-10 17:13:00 +0200, Leonard Mada wrote: > unsigned int count = 0; // Counter for values > // DO WE NEED AN ??? fCount ??? Nah.. there won't be more than 2^32 matrix elements.. > OR is (unsigned int) count OK size_t respectively SCSIZE is preferred instead for clear semantics. Eike -- OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer. OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS Please don't send personal mail to the [EMAIL PROTECTED] account, which I use for mailing lists only and don't read from outside Sun. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sc-dev] Statistical Functions Implementation
A small correction to my previous algorithm: 1. I forget a "count++;" in the first for LOOP [inside the: if (!pMat1->IsString(i,j) && !pMat2->IsString(i,j)) {} body] 2. because this count would be from 1 to n, the 2nd for LOOP should be modified accordingly: for(j = 0; j < count; j++) { // NOT j <= count 3. the x and y values (fValX[count] and fValY[count]) must be stored, so we have to define (variable) arrays. I do not know which method is best suited/ will affect speed less. I usually prefer vectors when dealing with such a situation, but might be too much for this one, especially if we do NOT do any sorting. Size is unfortunately not known beforehand. Kind regards, Leonard Mada - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sc-dev] Statistical Functions Implementation
Niklas Nebel wrote: sc/source/core/tool/interpr3.cxx, method ScInterpreter::ScCorrel. Calc uses indeed the *naïve algorithm* to compute the Correlation Coefficient (see Wikipedia). if (!pMat1->IsString(i,j) && !pMat2->IsString(i,j)) { fValX = pMat1->GetDouble(i,j); fValY = pMat2->GetDouble(i,j); fSumX += fValX; fSumSqrX += fValX * fValX; fSumY += fValY; fSumSqrY += fValY * fValY; fSumXY += fValX*fValY; fCount++; } The rewritten algorithm would look like: unsigned int count = 0; // Counter for values // DO WE NEED AN ??? fCount ??? OR is (unsigned int) count OK double fMeanX = 0.0; double fMeanY = 0.0; for (SCSIZE j = 0; j < nR1; j++) { if (!pMat1->IsString(i,j) && !pMat2->IsString(i,j)) { fValX[count] = pMat1->GetDouble(i,j); fValY[count] = pMat2->GetDouble(i,j); fMeanX += fValX[count]; fMeanY += fValY[count]; // ALTERNATIVELY SORT FIRST X AND Y AND // CALCULATE THE MEANS IN A SEPARATE LOOP // FOR GREATER ACCURACY } } if (count < 2) SetNoValue(); else { fMeanX = fMeanX/count; fMeanY = fMeanY/count; for(j = 0; j <= count; j++) { fSum += (fValX[j] – fMeanX) * (fValY[j] – fMeanY); fSDX += (fValX[j] – fMeanX) * (fValX[j] – fMeanX); fSDY += (fValY[j] – fMeanY) * (fValY[j] – fMeanY); } fVal = fSum / (SQRT(fSDX * fSDY) ); if( (fVal >= -1.0) && (fVal <= 1.0) ) { PushDouble( fSum / (SQRT(fSDX * fSDY) ) ); } else { // REPORT AN ERROR: INVALID VALUE // ALGORITHM FAILED } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sc-dev] Statistical Functions Implementation
Leonard Mada wrote: Mind if I ask, in which file is the CORREL function defined? Tests are still useful to see where the algorithm fails (and purportedly, the new algorithm should be correct). Extensive test cases are always welcome. sc/source/core/tool/interpr3.cxx, method ScInterpreter::ScCorrel. I also suggest that one of the developers sits down for half an hour and writes a TEXT file containing: - statistical function name - source file where it is defined - it would be of great help for others not knowing the OOo source code well, where to look for Find the function name in sc/source/core/src/compiler.src, the corresponding OpCode value in sc/inc/opcode.hxx, and look in ScInterpreter::Interpret (sc/source/core/tool/interpr4.cxx) which method is called for the OpCode. Add-In functions are separate, but that's how you can find any built-in function's implementation. Niklas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sc-dev] Statistical Functions Implementation
Niklas Nebel wrote: They don't "accept" a value of -0.999 instead of 1, they just calculated something different (autocorrelation). Indeed, that was autocorrelation. My fault. But the test case is still a beautiful one for CORRELATION. See also my other 2 test cases. Calc fails quite badly and this should be fixed by 2.05!!! IF the autocorrelation should be -0.999, then the simpler correlation between x and x+10 should be at least as accurate, which was definitely not the case. (By the way, does OOo have an autocorrelation function; I did not find one.) I performed the tests in gnumeric, too, and it seems gnumeric is accurate. (Well, be aware that if you open the .ods files, it would NOT recalculate the formulas.) I'm not sure what you're trying to achieve here. A single look at the source would show you how CORREL is implemented using sums and square sums, which limits the values for which it works. There's no need for more example guesswork. Mind if I ask, in which file is the CORREL function defined? Tests are still useful to see where the algorithm fails (and purportedly, the new algorithm should be correct). Extensive test cases are always welcome. I also suggest that one of the developers sits down for half an hour and writes a TEXT file containing: - statistical function name - source file where it is defined - it would be of great help for others not knowing the OOo source code well, where to look for Kind regards, Leonard Mada - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sc-dev] Statistical Functions Implementation
Leonard Mada wrote: I entered in column B(1): "=A1 + 1", that is I added the value "1" for every element in column A. That said, data values in columns A and B should be totally correlated. The NIST accepts a value of -0.999, but: They don't "accept" a value of -0.999 instead of 1, they just calculated something different (autocorrelation). - Calc gives only a value of -0.882. - R gives an absolute precise value of +1 (it reports positive values ) both with the cor(x,y) function and with the cor.test(x, y) function Not even the slightest deviation from this value. This should be definitely improved in Calc. NOTE: sorting the values in Calc, breaks CORREL() [it gives an #VALUE error, NO idea why]. Well, if I pair the non-ordered data in A with an ordered (A+1), I see a very strange result: CORREL() gives -1.01, BUT this coefficient CAN BE only between -1 and +1!!! SOME SERIOUS ERROR. I will try to make more tests during the weekend. I'm not sure what you're trying to achieve here. A single look at the source would show you how CORREL is implemented using sums and square sums, which limits the values for which it works. There's no need for more example guesswork. Niklas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]