[R] Behavior of self-defined function within ddply

2014-01-15 Thread Amitabh Dugar
I have a dataframe small whch has 5,000 rows and contains data for several 
tickers every month, as below:

  

 monthend_n ticker wgtdiff ret interval b1 b2 b3 b4 b5 b6 
1 19990228 AA 0.7172 -2.58 0.33896 -0.5868 -0.24784 0.09112 0.43008 0.76904 
1.108 
2 19990228 AAPL -0.0828 -15.48 0.33896 -0.5868 -0.24784 0.09112 0.43008 0.76904 
1.108 
3 19990228 ABCW 0.0966 -7.36 0.33896 -0.5868 -0.24784 0.09112 0.43008 0.76904 
1.108 

 … … 
 
 
 
 
 
 
 
 
 
705 19990331 AA 0.1932 1.7 0.31602 -0.7641 -0.44808 -0.13206 0.18396 0.49998 
0.816 
706 19990331 AAPL 0.033 3.23 0.31602 -0.7641 -0.44808 -0.13206 0.18396 0.49998 
0.816 
707 19990331 ABF 0.154 -20.51 0.31602 -0.7641 -0.44808 -0.13206 0.18396 0.49998 
0.816 
708 19990331 ABI 0.286 8.33 0.31602 -0.7641 -0.44808 -0.13206 0.18396 0.49998 
0.816 
etc.
Variables b1 through b6 are break points that I want to use in the cut 
function and they vary each month according to the distribution of the variable 
wgtdiff  during that month. 

To handle this I wrote a function as below:
cutfunc - function(df)
{
vec - df$wgtdiff
# need to apply unique function as break points within each month are same for 
all tickers (b1-b6 values same in each within month)
breaks - c(unique(df$b1), unique(df$b2), unique(df$b3), unique(df$b4), 
unique(df$b5), unique(df$b6))
bin - cut(vec, breaks,labels=F)
bin
}
Then  I tried:
temp4 - ddply(small, .(monthend_n), summarize, bins=cutfunc(small))
I was expecting  to get back a data frame with 5,000 rows with bins assignments 
for each ticker, and if there are 6 break points the bin #s should range from 1 
to 5.
However instead I get  a data frame with 40,000 rows and bin # ranging from 1- 
40, as below:
  monthend_n bins
1   19990228   40
2   19990228   17
3   19990228   22
...
5000   19990228   17
5001   19990331   40
5002   19990331   17
5003   19990331   22

etc

It seems ddply doesn't pass in monthly pieces of the data frame small into my 
cutfunc in the way I expect

Any guidance is appreciated.
Thanks

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to use ddply

2014-01-13 Thread Amitabh Dugar
I have never used R-help to pose a question to the R-users community; is 
sending this Email the right way to do so?

I am trying to use the ddply function in the plyr package to accomplish the 
following:
I have a data frame of the type:

     ticker monthend_n wgtdiff    ret
156      AA   19990228  0.7172  -2.58
545    AAPL   19990228 -0.0828 -15.48
925    ABCW   19990228  0.0966  -7.36
1041   ABFS   19990228  0.1320  -8.89
1165    ABI   19990228  0.2355   4.61
1482    ABS   19990228  0.1668  -6.56
1563    ABT   19990228  0.1650  -0.27
1790   ACAT   19990228  0.1540 -13.82
2498    ACN   19990228  0.  12.15
2532    ACO   19990228  0.1320   8.48
2857    ACV   19990228  0.1540  -6.54
2942   ACXM   19990228  0.  -6.13
3303   ADCT   19990228  0.1035   1.73
3568    ADM   19990228  0.1540   0.33
4072   ADSK   19990228 -0.1035  -9.19
4672    AEH   19990228  0.1650     NA
4673   AEIC   19990228  0.1314  -6.95
4867    AEP   19990228  0.1540  -3.62
157      AA   19990331  0.1932   1.70
546    AAPL   19990331  0.0330   3.23
1005    ABF   19990331  0.1540 -20.51
1166    ABI   19990331  0.2860   8.33
1255    ABK   19990331  0.0966  -3.57
1483    ABS   19990331  0.  -4.50
1564    ABT   19990331  0.3955   1.08
1733    ABX   19990331  0.2340  -3.53
2533    ACO   19990331  0.0966   5.26
3304   ADCT   19990331  0.2925  17.75
3418    ADI   19990331  0.2688  18.70
3724    ADP   19990331  0.1540 -38.43
4514    AEE   19990331  0.1540  -1.31
4868    AEP   19990331 -0.0966  -4.65

I am trying to generate quintile cutoff points across the distribution of 
tickers for every month, using the command:
 result - ddply(test, .(monthend_n), .fun=cut, test$wgtdiff,5)

I get the message:
Error in cut.default(piece, ...) : 'x' must be numeric

I tried creating a monthly list of data frames, extracting the wgtdiff column 
and passing that into the cut function, but that did not work either (as below)
pieces - split(test,test$monthend_n)
vectors- lapply(pieces,[[,wgtdiff)
quintiles - lapply(vectors,cut(vectors[1:2],5))
Error in cut.default(vectors[1:2], 5) : 'x' must be numeric

However, the cut function does the job correctly when I pass it only an 
individual month's data, as below:
first - pieces[[1]]
quintiles - cut(first$wgtdiff,5)
levels(quintiles)

What is the correct way to solve this problem?

Thanks for your help, everyone!


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.