Re: Project idea: Calc for Statistics

2012-12-14 Thread Rob Weir
On Fri, Dec 14, 2012 at 8:55 AM, Andrew Douglas Pitonyak
 wrote:
>
> On 12/06/2012 12:12 PM, Rob Weir wrote:
>>
>>
>> So two entirely different questions:
>>
>> 1) Improving the accuracy the statistical (and other numerical
>> methods) we already have.
>>
>> 2) Extending the range of numerical methods we provide out-of-the-box
>
>
> My first thought when I read this was adding extended precision interval
> arithmetic; now that would be fun :-)
>
>
>>
>> I think #1 is a no-brainer, but it does require some expertise.  The
>> hard part is determining whether we have improved.  For most problems
>> we probably already get the same results as SPSS, R or other standard
>> statistical packages.  To really make an improvement we need to test
>> the edge cases, the "poorly conditioned" and more complex cases.
>>
>> For #2, it probably makes sense to define a bridge to R.   R is now
>> the standard and there are hundreds of libraries that extend the
>> environment.  You can call R routines from SAS or SPPS.  I just got
>> the new Mathematica 9 upgrade, and guess what?  They've now added the
>> ability to call R.   So some seamless of calling R routines and
>> embedding R plots in Calc would be great.
>
>
> I considered upgrading Mathematica, but I am too busy to play around with it
> these days
>

I've played around a little.  It has now some built-in functions for
analyzing and graphing social networks, e.g., Facebook, Twitter.  Not
sure it is very useful, but perhaps a new software category of
"mathertainment"...

> Surprised that they integrate with R. Not because R is a bad thing, just
> something I had not expected because mathematica already does so much out of
> the box. Provides instant access to their huge repository of extra stuff.
>

So Mathematica out-of-the-box likely has as much as R has
out-of-the-box.  But the free 3rd party packages on CRAN (over 5000 of
them) is a big win for R.  The ecosystem is as important (or more so)
than the standalone app.  I think of it like Python -- IMHO the
language itself is undistinguished, but the existence of libraries for
every problem domain makes it my go-to tool for many problems.

I wonder if there is a lesson here?  Imagine magically we made our
templates and extensions repository 10x better (by some metric, not
necessarily size).  Or what about content repositories, e.g., clip
art, form letters, etc.  Our value proposition then becomes more about
the strength of the ecosystem and less about basic editing features.

-Rob

> --
> Andrew Pitonyak
> My Macro Document: http://www.pitonyak.org/AndrewMacro.odt
> Info:  http://www.pitonyak.org/oo.php
>


Re: Project idea: Calc for Statistics

2012-12-14 Thread Andrew Douglas Pitonyak


On 12/06/2012 12:12 PM, Rob Weir wrote:


So two entirely different questions:

1) Improving the accuracy the statistical (and other numerical
methods) we already have.

2) Extending the range of numerical methods we provide out-of-the-box


My first thought when I read this was adding extended precision interval 
arithmetic; now that would be fun :-)




I think #1 is a no-brainer, but it does require some expertise.  The
hard part is determining whether we have improved.  For most problems
we probably already get the same results as SPSS, R or other standard
statistical packages.  To really make an improvement we need to test
the edge cases, the "poorly conditioned" and more complex cases.

For #2, it probably makes sense to define a bridge to R.   R is now
the standard and there are hundreds of libraries that extend the
environment.  You can call R routines from SAS or SPPS.  I just got
the new Mathematica 9 upgrade, and guess what?  They've now added the
ability to call R.   So some seamless of calling R routines and
embedding R plots in Calc would be great.


I considered upgrading Mathematica, but I am too busy to play around 
with it these days


Surprised that they integrate with R. Not because R is a bad thing, just 
something I had not expected because mathematica already does so much 
out of the box. Provides instant access to their huge repository of 
extra stuff.


--
Andrew Pitonyak
My Macro Document: http://www.pitonyak.org/AndrewMacro.odt
Info:  http://www.pitonyak.org/oo.php



Re: Project idea: Calc for Statistics

2012-12-06 Thread Pedro Giffuni
Hi Regina;

>_
> From: Regina Henschel
> 
>Hi Pedro,
>
>Pedro Giffuni schrieb:
>> Hi guys;
>> 
>> FWIW, while I was playing with the new random number generator I went
>> around looking for some references and I found this paper from the Journal
>> of Statistical Software (2010) titled "On the Numerical Accuracy of
>> Spreadsheets":
>> 
>> http://www.jstatsoft.org/v34/i04/paper
>> 
>> 
>> It basically shows that Calc, among other Spreadsheet programs, is not
>> really well suited for statistical analysis.
>
>They use an old version of Calc. In the meantime Calc has got a lot of 
>accuracy improvements. And the new implementations in Excel 2010 are far more 
>accurate than the old ones. The special results of the paper are outdated. Of 
>cause the general problem of using spreadsheets for data exploration remains.


That's refreshing to know, thank you! The article linked by Tsutomu is somewhat
more up to date and indeed mentions that Excel has been working hard on that
field too.

The list towards the end of your message is very interesting too. I will have a 
look

too .. when I find time.

>
>
>> 
>> Something rather amazing is that the major statistic suites have been moving
>> towards a more "spreadsheet-like" environment. I am personally a fan of
>> Minitab as it brings many functions that I needed for Quality control in a
>> previous job. The price of the software package sky-rocketed in few years
>> though :(.
>
>I'm not familiar with special statistical software. One problem with Calc is, 
>that users do not how to use the functions in Calc for they purpose, for 
>example making an ANOVA. So providing wizards would be helpful.

Hmm .. I haven't looked at how Excel does Anova. We surely have the tools to
do Anova but people do expect to see it as a handy script somewhere. The
statistical packages out there are not very different in that sense and in many
ways they emulate Excel.

>
>> 
>> One approach could be improving our local functions to match more
>> demanding specifications: some of that will necessarily have to be done.
>> Another approach could be facilitating interactions with software like R,
>
>https://issues.apache.org/ooo/show_bug.cgi?id=66589
>

Yes, as I said that approach has many followers (Hi Rob :) ). Working on
one approach doesn't mean we forget the others.

>> 
>> and I am aware that approach has many followers. A third approach, which
>> I would like to suggest as a future project, would be developing a scaddin
>> focused on statistics and making full use of the functions from boost that
>> we already have available as a module but we are not using to their full
>> extent.
>
>I know that Calc is really inaccurate in some corner cases and a comparison 
>with the solutions from boost would be good. One problem is, that Calc is 
>limited to double precision because of the MSCV compiler. As far as I know, 
>boost uses own types to get better precision.
>

I am really hesitant to depend on the math functions in boost for the base Calc
because most users don't need such stuff and keeping up to date with Boost
can be painful. It's also rather nice to have our own implementations of the 
basic
functions.

With the boost stuff we get better performance and precision but we still have 
to
add the same high level functions/scripts for things like Anova. It would be 
fine
to use boost in scaddins, I think, and that would leave us a lot of space for
experimentation without interfering with the basic Calc. 

This is all wishful thinking though, I doubt i will have the time for this soon.

>> 

>> I know we are all busy with other stuff to improve for 4.0 Release, just
>> thought I'd leave the idea for the future.
>
>I had done a lot for statistical functions under the mentor-ship of Eike in 
>the past, but now I'm more interested in Draw.
>

Yes I noticed :). FWIW, my favorite drawing utility is Xara which was copylefted
some time ago but never picked much followers :(. Armin's work is absolutely
cool though.

>Some problems, which need to be solved are:
>- Adapt FDIST, FINV,  and TDIST to ODF
>- New algorithm needed in ScInterpreter::GetBetaDist, see "FIXME" there
>- Better detection of singular matrices
>- Change the LINEST function to check for collinearity (Excel compatibility)
>

Thanks for this shortlist

Pedro.


Re: Project idea: Calc for Statistics

2012-12-06 Thread Regina Henschel

Hi Pedro,

Pedro Giffuni schrieb:

Hi guys;

FWIW, while I was playing with the new random number generator I went
around looking for some references and I found this paper from the Journal
of Statistical Software (2010) titled "On the Numerical Accuracy of
Spreadsheets":

http://www.jstatsoft.org/v34/i04/paper


It basically shows that Calc, among other Spreadsheet programs, is not
really well suited for statistical analysis.


They use an old version of Calc. In the meantime Calc has got a lot of 
accuracy improvements. And the new implementations in Excel 2010 are far 
more accurate than the old ones. The special results of the paper are 
outdated. Of cause the general problem of using spreadsheets for data 
exploration remains.




Something rather amazing is that the major statistic suites have been moving
towards a more "spreadsheet-like" environment. I am personally a fan of
Minitab as it brings many functions that I needed for Quality control in a
previous job. The price of the software package sky-rocketed in few years
though :(.


I'm not familiar with special statistical software. One problem with 
Calc is, that users do not how to use the functions in Calc for they 
purpose, for example making an ANOVA. So providing wizards would be 
helpful.




One approach could be improving our local functions to match more
demanding specifications: some of that will necessarily have to be done.
Another approach could be facilitating interactions with software like R,


https://issues.apache.org/ooo/show_bug.cgi?id=66589



and I am aware that approach has many followers. A third approach, which
I would like to suggest as a future project, would be developing a scaddin
focused on statistics and making full use of the functions from boost that
we already have available as a module but we are not using to their full
extent.


I know that Calc is really inaccurate in some corner cases and a 
comparison with the solutions from boost would be good. One problem is, 
that Calc is limited to double precision because of the MSCV compiler. 
As far as I know, boost uses own types to get better precision.




I know we are all busy with other stuff to improve for 4.0 Release, just
thought I'd leave the idea for the future.


I had done a lot for statistical functions under the mentor-ship of Eike 
in the past, but now I'm more interested in Draw.


Some problems, which need to be solved are:
- Adapt FDIST, FINV,  and TDIST to ODF
- New algorithm needed in ScInterpreter::GetBetaDist, see "FIXME" there
- Better detection of singular matrices
- Change the LINEST function to check for collinearity (Excel compatibility)

Kind regards
Regina







Re: Project idea: Calc for Statistics

2012-12-06 Thread Rob Weir
On Thu, Dec 6, 2012 at 10:57 AM, Pedro Giffuni  wrote:
> Hi guys;
>
> FWIW, while I was playing with the new random number generator I went
> around looking for some references and I found this paper from the Journal
> of Statistical Software (2010) titled "On the Numerical Accuracy of
> Spreadsheets":
>
> http://www.jstatsoft.org/v34/i04/paper
>

Two other relevant papers:

http://arc.nucapt.northwestern.edu/~karnesky/sdarticle.pdf

http://www.csdassn.org/software_reports/gnumeric.pdf


>
> It basically shows that Calc, among other Spreadsheet programs, is not
> really well suited for statistical analysis.
>
> Something rather amazing is that the major statistic suites have been moving
> towards a more "spreadsheet-like" environment. I am personally a fan of
> Minitab as it brings many functions that I needed for Quality control in a
> previous job. The price of the software package sky-rocketed in few years
> though :(.
>
> One approach could be improving our local functions to match more
> demanding specifications: some of that will necessarily have to be done.
> Another approach could be facilitating interactions with software like R,
>
> and I am aware that approach has many followers. A third approach, which
> I would like to suggest as a future project, would be developing a scaddin
> focused on statistics and making full use of the functions from boost that
> we already have available as a module but we are not using to their full
> extent.
>

So two entirely different questions:

1) Improving the accuracy the statistical (and other numerical
methods) we already have.

2) Extending the range of numerical methods we provide out-of-the-box

I think #1 is a no-brainer, but it does require some expertise.  The
hard part is determining whether we have improved.  For most problems
we probably already get the same results as SPSS, R or other standard
statistical packages.  To really make an improvement we need to test
the edge cases, the "poorly conditioned" and more complex cases.

For #2, it probably makes sense to define a bridge to R.   R is now
the standard and there are hundreds of libraries that extend the
environment.  You can call R routines from SAS or SPPS.  I just got
the new Mathematica 9 upgrade, and guess what?  They've now added the
ability to call R.   So some seamless of calling R routines and
embedding R plots in Calc would be great.

-Rob

> I know we are all busy with other stuff to improve for 4.0 Release, just
> thought I'd leave the idea for the future.
>
> cheers,
>
> Pedro.


Re: Project idea: Calc for Statistics

2012-12-06 Thread Tsutomu Uchino
Hi,

I found the following paper several weeks ago in [1] (written in Japanese) that
descrives about the paper that Pedro mentioned.

"On the accuracy of statistical procedures in Microsoft Excel 2010",
Submitted but rejected, January 2012
http://homepages.ulb.ac.be/~gmelard/Recherche.htm

And I have seen some people want to use tool such as data analysys
tools provided on Excel.

There is Apache Commons Math library provides statistical tools
written in Java.
It is good stuff to make analysis tool as an extension also, if someone wanted.
I have started to make such thing but it's discontinued.

[1] http://oku.edu.mie-u.ac.jp/~okumura/blog/node/2585

Tsutomu

2012/12/7, Pedro Giffuni :
> Hi guys;
>
> FWIW, while I was playing with the new random number generator I went
> around looking for some references and I found this paper from the Journal
> of Statistical Software (2010) titled "On the Numerical Accuracy of
> Spreadsheets":
>
> http://www.jstatsoft.org/v34/i04/paper
>
>
> It basically shows that Calc, among other Spreadsheet programs, is not
> really well suited for statistical analysis.
>
> Something rather amazing is that the major statistic suites have been moving
> towards a more "spreadsheet-like" environment. I am personally a fan of
> Minitab as it brings many functions that I needed for Quality control in a
> previous job. The price of the software package sky-rocketed in few years
> though :(.
>
> One approach could be improving our local functions to match more
> demanding specifications: some of that will necessarily have to be done.
> Another approach could be facilitating interactions with software like R,
>
> and I am aware that approach has many followers. A third approach, which
> I would like to suggest as a future project, would be developing a scaddin
> focused on statistics and making full use of the functions from boost that
> we already have available as a module but we are not using to their full
> extent.
>
> I know we are all busy with other stuff to improve for 4.0 Release, just
> thought I'd leave the idea for the future.
>
> cheers,
>
> Pedro.


Project idea: Calc for Statistics

2012-12-06 Thread Pedro Giffuni
Hi guys;

FWIW, while I was playing with the new random number generator I went
around looking for some references and I found this paper from the Journal
of Statistical Software (2010) titled "On the Numerical Accuracy of 
Spreadsheets":

http://www.jstatsoft.org/v34/i04/paper


It basically shows that Calc, among other Spreadsheet programs, is not
really well suited for statistical analysis.

Something rather amazing is that the major statistic suites have been moving
towards a more "spreadsheet-like" environment. I am personally a fan of
Minitab as it brings many functions that I needed for Quality control in a
previous job. The price of the software package sky-rocketed in few years
though :(.

One approach could be improving our local functions to match more
demanding specifications: some of that will necessarily have to be done.
Another approach could be facilitating interactions with software like R,

and I am aware that approach has many followers. A third approach, which
I would like to suggest as a future project, would be developing a scaddin
focused on statistics and making full use of the functions from boost that
we already have available as a module but we are not using to their full
extent.

I know we are all busy with other stuff to improve for 4.0 Release, just
thought I'd leave the idea for the future.

cheers,

Pedro.