[R] Reasons to Use R

2007-04-11 Thread Alan Zaslavsky
Right: SAS objects (at least in the base and statistics components of the 
system -- there are dozens of add-ons for particular markets) are simple 
databases.  the predominant model for data manipulation and statistical 
calculation is a row by row operation that creates modified rows and/or 
accumulates totals.  This was pretty much the only way things could be 
done in the days when real (and typically virtual) memory was much smaller 
than it now is.  It can be a pretty efficient model for calculatons that 
fit that pattern.  One downside of course is that a line of R code can 
easily turn into 30 lines of SAS with data steps, sort steps, steps to 
accumulate totals, etc.

As noted by a couple of previous writers, S-Plus might be regarded as 
somewhat intermediate in its model in that objects constitute files but 
rows do not correspond to chunks of adjacent bytes in memory or filespace.

I have thought for a long time that a facility for efficient rowwise 
calculations might be a valuable enhancement to S/R.  The storage of the 
object would be handled by a database and there would have to be an 
efficient interface for pulling a row (or small chunk of rows) out of the 
database repeatedly; alternatively the operatons could be conducted inside
the database.  Basic operations of rowwise calculation and cumulation
(such as forming a column sum or a sum of outer-products) would be
written in an R-like syntax and translated into an efficient set of
operations that work through the database.  (Would be happy to share
some jejeune notes on this.)  However the main answer to thie problem
in the R world seems to have been Moore's Law.  Perhaps somebody could
tell us more about the S-Plus large objects library, or the work that
Doug Bates is doing on efficient calculations with large datasets.

    Alan Zaslavsky
[EMAIL PROTECTED]

> Date: Tue, 10 Apr 2007 16:27:50 -0600
> From: "Greg Snow" <[EMAIL PROTECTED]>
> Subject: Re: [R] Reasons to Use R
> To: "Wensui Liu" <[EMAIL PROTECTED]>
>
> I think SAS has the database part built into it.  I have heard 2nd hand
> of new statisticians going to work for a company and asking if they have
> SAS, the reply is "Yes we use SAS for our database, does it do
> statistics also?"  Also I heard something about SAS is no longer
> considered an acronym, they like having it be just a name and don't want
> the fact that one of the S's used to stand for statistics to scare away
> companies that use it as a database.
>
> Maybe someone more up on SAS can confirm or deny this.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reasons to Use R

2007-04-11 Thread Alan Zaslavsky
thanks, I will take a look.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] erratic behavior of match()?

2007-04-19 Thread Alan Zaslavsky
>Is this a consequence of machine error or something else?
>Could this be overcome? (It works correctly when integers are used in
>the sequences as well as in many other circumstances)

The usual solution for testing a==b with floating-point round-off error is 
abs(a-b) X1=seq(0,1,len=11)
> X2=seq(0,1,len=101)
> match(X1,X2)
  [1]   1  11  21  NA  41  51  NA  71  81  91 101
> match(round(X1,2),round(X2,2))
  [1]   1  11  21  31  41  51  61  71  81  91 101

In the following case, X2 does not round "exactly" but 7-digit accuracy 
seems fine.

> X2=seq(0,1,len=31)
> match(X1,X2)
  [1]  1  4  7 NA 13 16 NA NA 25 28 31
> match(round(X1,7),round(X2,7))
  [1]  1  4  7 10 13 16 19 22 25 28 31

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] random sampling with some limitive conditions?

2007-07-08 Thread Alan Zaslavsky
If I understand your problem, this might be a solution.  Assign 
independent random numbers for row and column and use the corresponding 
ordering to assign the row and column indices.  Thus row and column 
assignments are independent and the row and column totals are fixed.  If 
cc and rr are respectively the desired row and column totals, with 
sum(cc)==sum(rr), then

n = sum(cc)
row.assign = rep(1:length(rr),rr)[order(runif(n))]
col.assign = rep(1:length(cc),cc)[order(runif(n))]

If you want many such sets of random assignments to be generated at once 
you can use a few more rep() calls in the expressions to generate multiple 
sets in the same way.  (Do you actually want the assignments or just the 
tables?) Of course there are many other possible solutions since you have 
not fully specified the distribution you want.

Alan Zaslavsky
Harvard U

> From: "Zhang Jian" <[EMAIL PROTECTED]>
> Subject: [R] random sampling with some limitive conditions?
> To: r-help 
> 
> I want to gain thousands of random sampling data by randomizing the
> presence-absence data. Meantime, one important limition is that the row and
> column sums must be fixed. For example, the data "tst" is following:
>site1 site2 site3 site4 site5 site6 site7 site8 1 0 0 0 1 1 0 0 0 1 1 1 0
> 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0
> 0 0 0 0 0 0 0 0 1 0 1 0 1
> 
> sum(tst[1,]) = 3, sum(tst[,1])=4, and so on. When I randomize the data, the
> first row sums must equal to 3, and the first column sums must equal to 4.
> The rules need to be applied to each row and column.
> How to get the new random sampling data? I have no idea.
> Thanks.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.