Re: statistical computing

Radford Neal Tue, 22 Feb 2000 18:19:13 -0800
In article <[EMAIL PROTECTED]>,
Frank E Harrell Jr  <[EMAIL PROTECTED]> wrote:

> S-Plus is a powerful object-oriented language based on the S
> language (created at the same place that created the C language -
> ATT Bell Labs - C is for "computer" S for "statistics").

Usually, I find this sort of cheerleading just a bit annoying, but 
when given in response to a student's request for information about
what to learn, it gets downright misleading.

The truth:  S-Plus is not particularly powerful, and it's not
particularly well designed.  Its implementation by the people at AT&T
Bell labs is downright incompetent.  Of course, it has lots of support
for statistical methods, making it useful for statisticians to know,
but it's a shame that this has had the effect of locking academic
statisticians into this indifferently-designed and badly-implemented
language, which is completely unusable some purposes, such as Markov
chain Monte Carlo computations.  The free "R" look-alike for S may fix
at least the badly-implemented part of this dilema.

Below is a posting I recently made to sci.stat.math, remarking on the
amazingly *worse* performance of the latest release of S-Plus - which
is up to SEVENTEEN times slower than the previous release.

Finally, I doubt very much that the "C" language stands for "computer".
What would it's predecessor language, called "B", have stood for?

   Radford Neal

-------------------------------------------------------------------------

In article <[EMAIL PROTECTED]>,
Thomas Gatliffe  <[EMAIL PROTECTED]> wrote:

>R is a substitute but if you are really looking for the power of S-Plus you
>won't be satisfied.

This depends on what you mean by "power".  S-Plus has some facilities
that aren't in R, and some of the ones that R does have seem to be a
bit less mature - eg, I couldn't off hand figure out how to get R's lm
function to accept a "tolerance" parameter to allow it to accept
almost singular (but still doable) problems.  One the other hand, if
by "power" you mean the ability to do anything involving substantial
computation that isn't done by one of the built-in funcitons written
in C, then S-Plus is about the worst language you could possibly choose,
whereas R looks like it would be viable (though not especially fast).

What is most incredible is that the inefficiency of S-Plus has gotten
substantially worse in the latest release.  It's so bad that I wonder
if there is something peculiar about the version for our machine (SGI).
Here are the times (in seconds) to run five simple test programs that
are listed below:

    Program:    test1    test2     test3    test4    test5
 Iterations:      500     1000     10000     1000   100000

 S-PLUS 3.4:     18.3      9.0      38.4     31.0    232.9
 S-PLUS 5.1:     75.0    152.6     317.1    544.8    224.1
   R 0.90.1:     56.5      5.7      47.0     18.1     27.9

All results are on an SGI system with a 194 MHz R10000 processor.

Looking first at the OLD version of S-Plus (3.4) versus R, one sees
that on the first test, involving the built-in lm function, R is about
three times slower, perhaps due to the quality of the implementation
of that particular function.  On the other tests, R is about the same
speed as the old S-Plus, EXCEPT for test5, where the old S-Plus is
about 10 times slower.  This test appears to activate the S-Plus
design flaw that causes some memory not to be recovered in loops.
Execution then gets slower and slower as the loop goes round (it takes
the old S-Plus only 5.3 seconds to do 10000 iterations, which would
lead you to think it would take 53 seconds, not 232, to do 100000).
Perusal of this test (below) will show you that it's not doing
anything that you might not want to do in lots of your programs.
Trying to avoid anything that might activate this bug while writing
programs is a ridiculous waste of your time.

Turning now to the NEW S-Plus (5.1), one sees that the test involving
the built-in lm function is now about four times slower.  Not too
impressive an "upgrade".  But that's nothing compared to tests 2 to 4,
however - they're up to 17 times slower!  And the last test shows that
they still haven't fixed the memory bug.

This new release is so ridiculously bad that I wonder whether there's
something wrong with just the version for SGI machines.  I'd be
interested to find out what the results are on other machines.  Anyone
else with both the old and the new S-Plus want to try?

For comparison, the Maple equivalent of test5 (see below) takes 3.5
seconds to run on the same SGI machine as was used for the above
tests.  Clearly, R is not a particularly efficient interpreter either.
R does seem to avoid any truly spectacular inefficiencies, however.

Here are the programs, which don't do anything interesting or sensible
- I just typed in various things somewhat arbitrarily - but do do the
sort of things that one will want to do now and then.  They were run
for the numbers of iterations (n) shown above.  Times were from the
Unix "time" command, applied to an interactive run in which the
function was read with "source", then run, after which q() was
immediately done.  For R, the workspace was saved at the end to match
the behaviour of S in saving variables automatically.

test1:

time.test1 <- function (n)
{
  x1 <- c(1,4,1,3,2,3,4)
  x2 <- c(5,0,2,3,5,2,1)
   y <- c(4,3,2,4,3,2,1)

  for (i in 1:n)
  { ys <- sample(y)
    m <- lm(ys~x1+x2)
  }

  invisible()
}

test2:

time.test2 <- function (n)
{
  x1 <- c(1,4,1,3,2,3,4)
  x2 <- c(5,0,2,3,5,2,1)
   y <- c(4,3,2,4,3,2,1)

  k <- 0

  for (i in 1:n)
  { ys <- sample(y)
    if (cor(ys,x1)>cor(ys,x2))
    { k <- k+1
    }
  }

  invisible()
}

test3:

time.test3 <- function (n)
{
  x1 <- c(1,4,1,3,2,3,4)
  x2 <- c(5,0,2,3,5,2,1)
   y <- c(4,3,2,4,3,2,1)

  k <- 0

  for (i in 1:n)
  { for (j in 1:length(y))
    { y[j] <- x1[j] + x2[j]*y[j]
      if (y[j]<10 || y[j]>10)
      { y[j] <- 2.3
      }
    }
  }

  invisible()
}

test4:

time.test4 <- function (n)
{
  x1 <- c(1,4,1,3,2,3,4)
  x2 <- c(5,0,2,3,5,2,1)
   y <- c(0,0,0,0,0,0,0)

  k <- 0

  for (i in 1:n)
  { for (j in 1:length(y))
    { y[j] <- cor(x1+x2,x1-x2*j)
    }
    for (j in 1:length(x1))
    { x1[j] <- 0.1 - x1[j] 
    }
  }

  invisible()
}

test5:

time.test5 <- function (n)
{ 
  k <- 0
  for (j in 1:n)
  { v <- c(1,2,j,j,3)
    if (v[2]==2)
    { k <- k+1
    }
  }

  invisible()
}

Here's the Maple equivalent of test5:

test := proc(n)
   local j,k,v;
   k := 0;
   for j to n do
     v := [1,2,j,j,3]; 
     if v[2] = 2 then 
       k := k+1 
     fi
   od;
end;


----------------------------------------------------------------------------
Radford M. Neal                                       [EMAIL PROTECTED]
Dept. of Statistics and Dept. of Computer Science [EMAIL PROTECTED]
University of Toronto                     http://www.cs.utoronto.ca/~radford
----------------------------------------------------------------------------


===========================================================================
  This list is open to everyone. Occasionally, people lacking respect
  for other members of the list send messages that are inappropriate
  or unrelated to the list's discussion topics. Please just delete the
  offensive email.

  For information concerning the list, please see the following web page:
  http://jse.stat.ncsu.edu/
===========================================================================
Re: statistical computing

Reply via email to