Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-31 Thread Thomas Lumley
On Sat, 29 Jul 2006, Kevin B. Hendricks wrote:

 Hi Bill,

sum : igroupSums

 Okay, after thinking about this ...

 # assumes i is the small integer factor with n levels
 # v is some long vector
 # no sorting required

 igroupSums - function(v,i) {
   sums - rep(0,max(i))
   for (j in 1:length(v)) {
   sums[[i[[j - sums[[i[[j + v[[j]]
   }
   sums
 }

 if written in fortran or c might be faster than using split.  It is
 at least just linear in time with the length of vector v.

For sums you should look at rowsum().  It uses a hash table in C and last 
time I looked was faster than using split(). It returns a vector of the 
same length as the input, but that would easily be fixed.

The same approach would work for min, max, range, count, mean, but not for 
arbitrary functions.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-31 Thread Kevin B. Hendricks
Hi Thomas,

Here is a comparison of performance times from my own igroupSums  
versus using split and rowsum:

  x - rnorm(2e6)
  i - rep(1:1e6,2)
 
  unix.time(suma - unlist(lapply(split(x,i),sum)))
[1] 8.188 0.076 8.263 0.000 0.000
 
  names(suma)- NULL
 
  unix.time(sumb - igroupSums(x,i))
[1] 0.036 0.000 0.035 0.000 0.000
 
  all.equal(suma, sumb)
[1] TRUE
 
  unix.time(sumc - rowsum(x,i))
[1] 0.744 0.000 0.742 0.000 0.000
 
  sumc - sumc[,1]
  names(sumc)-NULL
  all.equal(suma,sumc)
[1] TRUE


So my implementation of igroupSums is faster and already handles NA.   
I also have implemented igroupMins, igroupMaxs, igroupAnys,  
igroupAlls, igroupCounts, igroupMeans, and igroupRanges.

The igroup functions I implemented do not handle weights yet but do  
handle NAs properly.

Assuming I clean them up, is anyone in the R developer group interested?

Or would you rather I instead extend the rowsum appropach to create  
rowcount, rowmax, rowmin, rowcount, etc using a hash function approach.

All of these approaches simply use differently ways to map group  
codes to integers and then do the functions the same.

Thanks,

Kevin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-30 Thread Kevin B. Hendricks
Hi Bill,

After playing with this some more and adding an implementation to  
handle NAs in the data vector, I have run into the problem of what to  
return when the only data values for a particular bin (or level) in  
the data vector were NAs and the user selected na.rm=T

1. Should it return 0 for counts of that particular bin and NA for  
that bin for all of the other functions?  If so, wouldn't that be  
strange to return a NA just since there is no valid data for that bin  
because the user asked for na.rm=T?

2.  Or do I have to literally rebuild the final result vector,  
removing all unused bins before returning the results?   And  
wouldn't that cause problems in not all of the levels from 1:ngroups  
will be returned for some variables and not for others.

I personally like the approach of 1. better since if I give an igroup  
function my groups and tell it to na.rm=T from my data vector, I  
would really want all group levels returned and not just the ones  
that had valid data in them and if a particular group had no data, I  
would want the count to be 0 for that bin and all of the other funs  
to return NA for that particular bin?

Is that what you are returning in that case?

Also, do you always return Sums, Maxs, and Mins as numeric or do  
you sometimes return integer values if an integer data vector is  
passed in?

Are Counts always returned as integer or do you always set them  
to numeric or does that vary with the type of the data vector  
passed in?

Do you handle complex data vectors in a similar fashion (ie. using  
the length of the complex vector as its value for Maxs, Mins, etc?)?

Thanks,

Kevin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-29 Thread Kevin B. Hendricks
Hi Bill,

So you wrote one routine that can calculate any single of a variety  
of stats and allows weights, is that right?  Can it return a data  
frame of any subset of requested stats as well (that is what I was  
thinking of doing anyway).

I think someone can easily calculate all of those things in one pass  
through the array and then allow the user to select which of the new  
columns of stats should be added to a data.frame that is returned to  
the user.

To test all of this, I simply wrote my own igroupSums and integrated  
it into r-devel based on the code in split.c.  I can easily modify it  
to handle the case of calculating  a variety of stats (even all at  
the same time if desired).  I do not deal with weights at all and  
ignored that for now.

Here is what your test case now shows on my machine with the latest R  
build
with my added igroupSums routine (added internally to R).

  x - rnorm(2e6)
  i - rep(1:1e6,2)
  unix.time(Asums - unlist(lapply(split(x,i),sum)))
[1] 8.940 0.112 9.053 0.000 0.000
  names(Asums) - NULL

My version of igroupSums does not keep the names so I remove them to  
make the results comparable.

Here is my my own internal function igroupSums

  unix.time(Bsums - igroupSums(x,i))
[1] 0.932 0.024 0.958 0.000 0.000
 
  all.equal(Asums, Bsums)
[1] TRUE


So the speed up is quite significant (9.053 seconds vs 0.858 seconds).

I will next modify my code to handle any single one of maxs, mins,  
sums, counts, anys, alls, means, prods, and ranges by user choice.   
Although I will leave the use of weights as unimplemented for now (I  
always get mixed up thinking about weights and basic stats and I  
never use them so ...)

In case others want to play around with this too, here is the R  
wrapper in igroupSums.R to put in src/library/base/R/


igroupSums - function(x, f, drop = FALSE, ...) UseMethod(igroupSums)

igroupSums.default - function(x, f, drop=FALSE, ...)
{
 if(length(list(...))) .NotYetUsed(deparse(...), error = FALSE)

 if (is.list(f)) f - interaction(f, drop = drop)
 else if (drop || !is.factor(f)) # drop extraneous levels
 f - factor(f)
 storage.mode(f) - integer  # some factors have double
 if (is.null(attr(x, class)))
 return(.Internal(igroupSums(x, f)))
 ## else
 r - by(x,f,sum)
 r
}

igroupSums.data.frame - function(x, f, drop = FALSE, ...)
 lapply(split(seq(length=nrow(x)), f, drop = drop, ...),
function(ind) x[ind, , drop = FALSE])


And here is a very simple igroupSums.c to put in src/main/

It still needs a lot of work since it does not handle NAs in the  
vector x yet and still needs to be modified into a general routine to  
handle any single function of counts, sums, maxs, mins, means, prods,  
anys, alls, and ranges


#ifdef HAVE_CONFIG_H
#include config.h
#endif

#include Defn.h

SEXP attribute_hidden do_igroupSums(SEXP call, SEXP op, SEXP args,  
SEXP env)
{
 SEXP x, f, sums;
 int i, j, nobs, nlevs, nfac;

 checkArity(op, args);

 x = CAR(args);
 f = CADR(args);
 if (!isVector(x))
 errorcall(call, _(first argument must be a vector));
 if (!isFactor(f))
 errorcall(call, _(second argument must be a factor));
 nlevs = nlevels(f);
 nfac = LENGTH(CADR(args));
 nobs = LENGTH(CAR(args));
 if (nobs = 0)
 return R_NilValue;
 if (nfac = 0)
 errorcall(call, _(Group length is 0 but data length  0));
 if (nobs % nfac != 0)
 warningcall(call, _(data length is not a multiple of split  
variable));

 PROTECT(sums = allocVector(TYPEOF(x), nlevs));
 switch (TYPEOF(x)) {
 case INTSXP:
 for (i=0; i  nlevs; i++) INTEGER(sums)[i] = 0;
 break;
 case REALSXP:
 for (i=0; i  nlevs; i++) REAL(sums)[i] = 0.0;
 break;
 default:
 UNIMPLEMENTED_TYPE(igroupSums, x);
 }
 for (i = 0;  i  nobs; i++) {
 j = INTEGER(f)[i % nfac];
 if (j != NA_INTEGER) {
 j--;
 switch (TYPEOF(x)) {
 case INTSXP:
 INTEGER(sums)[j] = INTEGER(sums)[j] + INTEGER(x)[i];
 break;
 case REALSXP:
 REAL(sums)[j] = REAL(sums)[j] + REAL(x)[i];
 break;
 default:
 UNIMPLEMENTED_TYPE(igroupSums, x);
 }
 }
 }
 UNPROTECT(1);
 return sums;
}


If anyone is playing with this themselves, don't forget to update  
Internal.h and names.c to reflect the added routine before you make  
clean and then rebuild.

Once I finish, I will post me patches here and then if someone would  
like to modify them to implement weights, please let me know.

Even if these never get added to R I can keep them in my own tree and  
use them for my own work.

Thanks again for all of your hints and guidance.  This alone will  
speed up my R code greatly!

Kevin

 That is roughly what I did in C code for the Splus version.
 E.g., here is the 

Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-28 Thread Kevin B. Hendricks
Hi,

I was using my installed R which is 2.3.1 for the first tests.  I  
moved to the r-devel tree (I svn up and rebuild everyday) for my by  
tests to see if it would work any better.  I neglected to retest  
merge with the devel version.

So it appears merge is already fixed and I just need to worry about  
by.

On Jul 28, 2006, at 3:06 AM, Brian D Ripley wrote:

 Which version of R are you looking at?   R-devel has

 o merge() works more efficiently when there are relatively few
   matches between the data frames (for example, for 1-1
   matching).  The order of the result is changed for 'sort = FALSE'.


Thanks,

Kevin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-28 Thread Gabor Grothendieck
There was a performance comparison of several moving average
approaches here:
http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html

The author of that message ultimately wrote the caTools R package
which contains some optimized versions.

Not sure if these results suggest anything of interest here but it
would be interesting if various base routines could be sped up
to the point that a simple idiom is competitive with the caTools versions.


On 7/28/06, Kevin B. Hendricks [EMAIL PROTECTED] wrote:
 Hi,

 I was using my installed R which is 2.3.1 for the first tests.  I
 moved to the r-devel tree (I svn up and rebuild everyday) for my by
 tests to see if it would work any better.  I neglected to retest
 merge with the devel version.

 So it appears merge is already fixed and I just need to worry about
 by.

 On Jul 28, 2006, at 3:06 AM, Brian D Ripley wrote:

  Which version of R are you looking at?   R-devel has
 
  o merge() works more efficiently when there are relatively few
matches between the data frames (for example, for 1-1
matching).  The order of the result is changed for 'sort = FALSE'.
 

 Thanks,

 Kevin

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-28 Thread Martin Maechler
 Kevin == Kevin B Hendricks [EMAIL PROTECTED]
 on Fri, 28 Jul 2006 14:53:57 -0400 writes:

[.]

Kevin The idea is to somehow make functions that work well
Kevin over small sub- sequences of a much longer vector
Kevin without resorting to splitting the vector into many
Kevin smaller vectors.

Kevin In my particular case, the problem was my data frame
Kevin had over 1 million lines had probably over 500,000
Kevin unique sort keys (ie. think of it as an R factor with
Kevin over 500,000 levels).  The implementation of by
Kevin uses tapply which in turn uses split.  So split
Kevin simply ate up all the time trying to create 500,000
Kevin vectors each of short length 1, 2, or 3; and the
Kevin associated garbage collection.

Not that I have spent enough time thinking about this thread's
topic, but I have seen more than one case where using  tapply()
unnecessarily slowed down computations.
I don't remember the details, but know that in one case, replacing
tapply() by a few lines of code {one of which using lapply() IIRC},
sped up that computation by a factor (of 2 ? or more?).

I also vaguely remember that I thought about making tapply()
faster, but came to the conclusion it could not be
sped up quickly, because it works in a quite more general
context than it was used in that application (and maybe yours?).


Kevin I simple loop that walked the short sequence of
Kevin values (since the data frame was already sorted)
Kevin calculating what it needed, would work much faster
Kevin than splitting the original vector into so very many
Kevin smaller vectors (and the associated copying of data).

Kevin That problem is very similar problem to the
Kevin calculation of basic stats on a short moving window
Kevin over a very long vector.

 The author of that message ultimately wrote the caTools R
 package which contains some optimized versions.

Kevin I will look into that package and maybe use it for a
Kevin model for what I want to do.

Kevin Thanks,

Kevin Kevin

Kevin __
Kevin R-devel@r-project.org mailing list
Kevin https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-28 Thread Kevin B. Hendricks
Hi Bill,

 Splus8.0 has something like what you are talking about
 that provides a fast way to compute
 sapply(split(xVector, integerGroupCode), summaryFunction)
 for some common summary functions.  The 'integerGroupCode'
 is typically the codes from a factor, but you could compute
 it in other ways.  It needs to be a small integer in
 the range 1:ngroups (like the 'bin' argument to tabulate).
 Like tabulate(), which is called from table(), these are
 meant to be called from other functions that can set up
 appropriate group codes.  E.g., groupSums or rowSums or
 fancier things could be based on this.

 They do not insist you sort the input in any way.  That
 would really only be useful for group medians and we haven't
 written that one yet.

The sort is also useful for recoding each group into subgroups based  
on some other numeric vector.  This is the problem I run into trying  
to build portfolios that can be used as benchmarks for long term  
stock returns.

Another issue I have is that to recode a long character string that I  
use as a sort key for accessing a subgroup of the data in the  
data.frame to a set of small integers is not fast.  I can make a fast  
implementation if the data is sorted by the key, but without the  
sort, just converting my sort keys to the required small integer  
codes would be expensive for very long vectors since my small integer  
codes would have to reflect the order of the data (ie. be increasing  
subportfolio numbers).

More specifically, I am now converting all of my SAS code to R code  
and the problem is I have lots of snippets of SAS that do the  
following ...

PROC SORT;
   BY MDSIZ FSIZ;

/* WRITE OUT THE MIN SIZE CUTOFF VALUES */
PROC UNIVARIATE NOPRINT;
VAR FSIZ;
BY MDSIZ;
OUTPUT OUT=TMPS1 MIN=XMIN;

where my sort key MDSIZ is a character string that is the  
concatenation of the month ending date MD and the size portfolio of a  
particular firm (SIZ) and I want to find the cutoff points (the mins)  
for each of the portfolios for every month end date across all traded  
firms.



 The typical prototype is

 igroupSums
 function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if 
 (is.null(
 group)) 1 else max(as.integer(group), na.rm = T))

 and the currently supported summary functions are
mean : igroupMeans
sum : igroupSums
prod : igroupProds
min : igroupMins
max : igroupMaxs
range : igroupRanges
any : igroupAnys
all : igroupAlls

SAS is similar in that is also has a specific list of functions you  
can request including all of the basic stats from a PROC univariate  
including higher moment stuff (skewness, kurtosis,  robust  
statistics, and even statistical test results for each coded  
subgroup, and the nice thing is all combinations can be done with one  
call.

But to do that SAS does require the presorting, but it does run  
really fast for even super long vectors with lots of sort keys.

Similarly the next snippet of code, will take the file and resort it  
by the portfolio key and then the market to book ratio (MTB) for all  
trading firms for all monthly periods since 1980.It will then  
split each size portfolio for each month ending date into 5 equal  
portfolios based on market to book ratios (thus the need for the  
sort).   SAS returns a coded integer vector PMTB (made up of 1s to 5  
with 1s's for the smallest MTB and 5 for the largest MTB) repeated  
for each subgroup of MDSIZ.  PMTB matches the original vector in  
length and therefore fits right into the data frame.

/* SPLIT INTO Market to Book QUINTILES BY MDSIZ */
PROC SORT;
   BY MDSIZ MTB;
PROC RANK GROUPS=5 OUT=TMPS0;
VAR MTB;
RANKS PMTB;
BY MDSIZ;

The problem of assigning elements of a long data vector to portfolios  
and sub portfolios based on the values of specific data columns which  
must be calculated at each step and are not fixed or hardcoded is one  
that finance can run into (and therefore I run into it).

So by sorting I could handle the need for small integer recoding  
and the small integers would have meaning (i.e. higher values will  
represent larger MTB firms, etc).

That just leaves the problem of calculating stats on short sequences  
of of a longer integer.

 They are fast:

 x-runif(2e6)
 i-rep(1:1e6, 2)
 sys.time(sx - igroupSums(x,i))
[1] 0.66 0.67
 length(sx)
[1] 100

 On that machine R takes 44 seconds to go the lapply/split
 route:

 unix.time(unlist(lapply(split(x,i), sum)))
[1] 43.24  0.78 44.11  0.00  0.00


Yes!  That is exactly what I need.

Are there plans for adding something like that to R?

Thanks,

Kevin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-28 Thread Bill Dunlap
On Fri, 28 Jul 2006, Kevin B. Hendricks wrote:

 Hi Bill,

  Splus8.0 has something like what you are talking about
  that provides a fast way to compute
  sapply(split(xVector, integerGroupCode), summaryFunction)
  for some common summary functions.  The 'integerGroupCode'
  is typically the codes from a factor, but you could compute
  it in other ways.  It needs to be a small integer in
  the range 1:ngroups (like the 'bin' argument to tabulate).
  Like tabulate(), which is called from table(), these are
  meant to be called from other functions that can set up
  appropriate group codes.  E.g., groupSums or rowSums or
  fancier things could be based on this.
 
  They do not insist you sort the input in any way.  That
  would really only be useful for group medians and we haven't
  written that one yet.

 The sort is also useful for recoding each group into subgroups based
 on some other numeric vector.  This is the problem I run into trying
 to build portfolios that can be used as benchmarks for long term
 stock returns.

 Another issue I have is that to recode a long character string that I
 use as a sort key for accessing a subgroup of the data in the
 data.frame to a set of small integers is not fast.  I can make a fast
 implementation if the data is sorted by the key, but without the
 sort, just converting my sort keys to the required small integer
 codes would be expensive for very long vectors since my small integer
 codes would have to reflect the order of the data (ie. be increasing
 subportfolio numbers).

True, but the underlying grouped summary code
shouldn't require you to do the sorting.  If
codes - match(char, sort(unique(char)))
is too slow then you could try sorting the
data set by th 'char' column and doing
codes - cumsum(char[-1] != char[-length(char)])
(loop over columns before doing cumsum if there
are several columns).

If that isn't fast enough, then you could sort
in the C code as well, but I think there could
be lots of cases where that is slower.

I've used this code for out of core applications,
where I definitely do not want to sort the whole
dataset.

 More specifically, I am now converting all of my SAS code to R code
 and the problem is I have lots of snippets of SAS that do the
 following ...

 PROC SORT;
BY MDSIZ FSIZ;

 /* WRITE OUT THE MIN SIZE CUTOFF VALUES */
 PROC UNIVARIATE NOPRINT;
 VAR FSIZ;
 BY MDSIZ;
 OUTPUT OUT=TMPS1 MIN=XMIN;

 where my sort key MDSIZ is a character string that is the
 concatenation of the month ending date MD and the size portfolio of a
 particular firm (SIZ) and I want to find the cutoff points (the mins)
 for each of the portfolios for every month end date across all traded
 firms.


 
  The typical prototype is
 
  igroupSums
  function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if
  (is.null(
  group)) 1 else max(as.integer(group), na.rm = T))
 
  and the currently supported summary functions are
 mean : igroupMeans
 sum : igroupSums
 prod : igroupProds
 min : igroupMins
 max : igroupMaxs
 range : igroupRanges
 any : igroupAnys
 all : igroupAlls

 SAS is similar in that is also has a specific list of functions you
 can request including all of the basic stats from a PROC univariate
 including higher moment stuff (skewness, kurtosis,  robust
 statistics, and even statistical test results for each coded
 subgroup, and the nice thing is all combinations can be done with one
 call.

 But to do that SAS does require the presorting, but it does run
 really fast for even super long vectors with lots of sort keys.

 Similarly the next snippet of code, will take the file and resort it
 by the portfolio key and then the market to book ratio (MTB) for all
 trading firms for all monthly periods since 1980.It will then
 split each size portfolio for each month ending date into 5 equal
 portfolios based on market to book ratios (thus the need for the
 sort).   SAS returns a coded integer vector PMTB (made up of 1s to 5
 with 1s's for the smallest MTB and 5 for the largest MTB) repeated
 for each subgroup of MDSIZ.  PMTB matches the original vector in
 length and therefore fits right into the data frame.

 /* SPLIT INTO Market to Book QUINTILES BY MDSIZ */
 PROC SORT;
BY MDSIZ MTB;
 PROC RANK GROUPS=5 OUT=TMPS0;
 VAR MTB;
 RANKS PMTB;
 BY MDSIZ;

 The problem of assigning elements of a long data vector to portfolios
 and sub portfolios based on the values of specific data columns which
 must be calculated at each step and are not fixed or hardcoded is one
 that finance can run into (and therefore I run into it).

 So by sorting I could handle the need for small integer recoding
 and the small integers would have meaning (i.e. higher values will
 represent larger MTB firms, etc).

 That just leaves the problem of calculating stats on short sequences
 of of a longer integer.

  They are fast:
 
  x-runif(2e6)
  i-rep(1:1e6, 2)
  sys.time(sx - 

Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-28 Thread Kevin B. Hendricks
Hi Bill,

sum : igroupSums

Okay, after thinking about this ...

# assumes i is the small integer factor with n levels
# v is some long vector
# no sorting required

igroupSums - function(v,i) {
   sums - rep(0,max(i))
   for (j in 1:length(v)) {
   sums[[i[[j - sums[[i[[j + v[[j]]
   }
   sums
}

if written in fortran or c might be faster than using split.  It is  
at least just linear in time with the length of vector v.  This  
approach could be easily made parallel to t threads simply by picking  
t starting points someplace along v and running this routine in  
parallel on each piece.  You could even do it without thread locking  
if sums elements can be accessed atomically or by creating multiple  
copies of sums (one for each piece) and then doing a final addition.

I still think I am missing some obvious way to do this but ...

Am I thinking along the right lines?

Kevin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Any interest in merge and by implementations specifically for sorted data?

2006-07-27 Thread Seth Falcon
Kevin B. Hendricks [EMAIL PROTECTED] writes:
 My first R attempt was a simple

 # sort the data.frame gd and the sort key
 sorder - order(MDPC)
 gd  - gd[sorder,]
 MDPC - MDPC[sorder]
 attach(gd)

 # find the length and sum for each unique sort key
 XN - by(MVE, MDPC, length)
 XSUM - by(MVE, MDPC, sum)
 GRPS - levels(as.factor(MDPC))

 Well the ordering and sorting was reasonably fast but the first by  
 statement was still running 4 hours later on my machine (a dual 2.6  
 gig Opteron with 4 gig of main memory).  This same snippet of code in  
 SAS running on a slower machine takes about 5 minutes of system
 time.

I wonder if split() would be of use here.  Once you have sorted the
data frame gd and the sort keys MDPC, you could do:

gdList - split(gd$MVE, MDPC)

xn - sapply(gdList, length)
xsum - sapply(gdList, sum)

+ seth

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel