[R] problems with rollapply {zoo}
Here is a relatively simple script (with comments as to the logic interspersed): # Some of these libraries are probably not needed here, but leaving them in place harms nothing: library(tseries) library(xts) library(quantmod) library(fGarch) library(fTrading) library(ggplot2) # Set the working directory, where the data file is located, and read the raw data setwd('C:/cygwin/home/Ted/New.Task/NKS-quotes/NKS-quotes') x = read.table(quotes_M11.dat, header = FALSE, sep=\t, skip=0) str(x) # Set up the date column dt-sprintf(%s %04d,x$V2,x$V4) dt-as.POSIXlt(dt,format=%Y-%m-%d %H%M) # Prepare a frame that gets converted to an xts object y - data.frame(dt,x$V5) colnames(y) - c(tickdate,price) # Make the xts object, and then the OHLC object (as an aside, the tick data includes volume, but I have yet to figure out how to make an OHLC object hat includes volume) z - xts(y[,2],y[,1]) alpha - to.minutes3(z, OHLC=TRUE) colnames(alpha) - c(Open,High,Low,Close) alpha$rel_t - seq(1-nrow(alpha),0) # Just to check the code for the regression, apply the regression to the whole series (unless the series is realy short or has a strong slow pattern the regression result is not useful except to show that the code works) polyfit - lm(Close ~ poly(rel_t,4),alpha) polyfit2 - lm(Close ~ rel_t + I(rel_t^2) + I(rel_t^3) + I(rel_t^4), data=alpha) # This is the objective, where all the magic happens rollRegFun - function(d,i) { # set up the relative time variable, so that the current record has rt = 0 d$rt - seq(1-nrow(d),0) # apply the regression to fit a 4th degree polynomial in rt polyfit - lm(Close ~ poly(rt,4),d) # get the coefficients p - coef(polyfit) # get the roots of the first derivative of the fitted polynomial pr - polyroot(c(p[2],2*p[3],3*p[4],4*p[5])) # define a function that evaluates the second derivative as a function of x dd - function(x) { rv = 2*p[3]+6*p[4]*x+12*p[5]*x*x;rv;} # evaluate the second derivative at the ith root, and print the result r - dd(pr[i]) r } rollRegFun(alpha,1) rollRegFun(alpha,2) rollRegFun(alpha,3) The code I show above does not give an error, but if the function is re-written as: rFun - function(d) { d$rt - seq(1-nrow(d),0) polyfit - lm(Close ~ poly(rt,4),d) p - coef(polyfit) pr - polyroot(c(p[2],2*p[3],3*p[4],4*p[5])) dd - function(x) { rv = 2*p[3]+6*p[4]*x+12*p[5]*x*x;rv;} r - dd(pr[1]) r } And I try to get rollapply to execute it on a moving window, I get errors. E.g. rollapply(as.zoo(alpha),60,rFun) Error in from:to : argument of length 0 Yet, the following works: rollapply(alpha$Close,60,mean) what do I have to do to either my function or my use of rollapply in order to get it to work? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] plotOHLC(alpha3): Error in plotOHLC(alpha3) : x is not a open/high/low/close time series
Hi Joshua, Thanks. I had used irts because I thought I had to. The tick data I have has some minutes in which there is no data, and others when there are hundreds, or even thousands. If xts supports irregular data, the that is one less step for me to worry about. Alas, your suggestion didn't help: z - xts(y[,2], y[,1]) alpha3 - to.minutes3(z, OHLC=TRUE) plotOHLC(alpha3) Error in plotOHLC(alpha3) : x is not a open/high/low/close time series str(alpha3) An ‘xts’ object from 2010-06-30 15:47:00 to 2011-10-31 15:14:00 containing: Data: num [1:98865, 1:4] 9215 9220 9205 9195 9195 ... - attr(*, dimnames)=List of 2 ..$ : NULL ..$ : chr [1:4] z.Open z.High z.Low z.Close Indexed by objects of class: [POSIXct,POSIXt] TZ: xts Attributes: NULL Is there anything else I might try? Thanks again, Ted -- View this message in context: http://r.789695.n4.nabble.com/plotOHLC-alpha3-Error-in-plotOHLC-alpha3-x-is-not-a-open-high-low-close-time-series-tp4283217p4286124.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] plotOHLC(alpha3): Error in plotOHLC(alpha3) : x is not a open/high/low/close time series
Thanks Joshua, That did it. Cheers, Ted -- View this message in context: http://r.789695.n4.nabble.com/plotOHLC-alpha3-Error-in-plotOHLC-alpha3-x-is-not-a-open-high-low-close-time-series-tp4283217p4286963.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] plotOHLC(alpha3): Error in plotOHLC(alpha3) : x is not a open/high/low/close time series
R version 2.12.0, 64 bit on Windows. Here is a short script that illustrates the problem: library(tseries) library(xts) setwd('C:\\cygwin\\home\\Ted\\New.Task\\NKs-01-08-12\\NKs\\tests') x = read.table(quotes_h.2.dat, header = FALSE, sep=\t, skip=0) str(x) y - data.frame(as.POSIXlt(paste(x$V2,substr(x$V4,4,8),sep= ),format='%Y-%m-%d %H:%M'),x$V5) colnames(y) - c(tickdate,price) str(y) plot(y) z - as.irts(y) str(z) plot(z) str(alpha3) List of 2 $ time : POSIXt[1:98865], format: 2010-06-30 15:47:00 2010-06-30 15:53:00 2010-06-30 17:36:00 ... $ value: num [1:98865, 1:4] 9215 9220 9205 9195 9195 ... ..- attr(*, dimnames)=List of 2 .. ..$ : NULL .. ..$ : chr [1:4] z.Open z.High z.Low z.Close - attr(*, class)= chr ts - attr(*, tsp)= num [1:3] 1 2 1 alpha3 - as.xts(to.minutes3(z,OHLC = TRUE)) plotOHLC(alpha3) Error in plotOHLC(alpha3) : x is not a open/high/low/close time series The file quotes_h.2.dat contains real time tick data for futures contracts, so the above manipulation is my attempt to just get a time series with one column being a date/time and the other being tick price. I believe I have to use read.table to make a data frame, and then the manipulations to combine the date and time fields from that feed, along with the price. My first attempt at using to.minutes3 (and I am interested in the other 'to.period' functions too), is to get a regular time series to which I can apply rollapply, along with a function in which I use various autoregression methods, along with forecasting for as long as the 95% confidence intervals is reasonably close - I want to know how far into the future the forecast contains useful information. And then, I want to create a plot in which I do the autoregression, and then plot the actual and forecast prices (along with the confidence interval), as a function of time, embed that in a function, which rollappply works with, so I can have a plot comprised of all those individual plots (plotting only the comparison of actual and forecast values). It seems everything works adequately until I try the plotOHLC function itself, which gives me the error in the subject line. I would ask for two things: 1) what the fix is to get rid of that error plotOHLC gives me 2) some tips on the 'walk-forward' method I am looking at using. Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How do I get rid of list elements where the value is NULL before applying rbind?
Here is the function that makes the data.frames in the list: funweek - function(df) if (length(df$elapsed_time) 5) { res = fitdist(df$elapsed_time,exp) year = df$sale_year[1] sample = df$sale_week[1] mid = df$m_id[1] estimate = res$estimate sd = res$sd samplesize = res$n loglik = res$loglik aic = res$aic bic = res$bic chisq = res$chisq chisqpvalue = res$chisqpvalue chisqdf = res$chisqdf if (!is.null(estimate) !is.null(sd) !is.null(loglik) !is.null(aic) !is.null(bic) !is.null(chisq) !is.null(chisqpvalue) !is.null(chisqdf)) { rv = data.frame(mid,year,sample,samplesize,estimate,sd,loglik,aic,bic,chisq,chisqpvalue,chisqdf) rv } } I use the following, with different data, successfully: z - lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week),drop = TRUE), funweek) qqq - z[, c('mid', 'year', 'sample', 'samplesize', 'estimate', 'sd', 'loglik', 'aic','bic', 'chisq', 'chisqpvalue', 'chisqdf')] ndf2 - do.call(rbind, qqq) However, I am now getting the following error: qqq - z[, c('mid', 'year', 'sample', 'samplesize', 'estimate', 'sd', 'loglik', 'aic','bic', 'chisq', 'chisqpvalue', 'chisqdf')] Error in z[, c(mid, year, sample, samplesize, estimate, sd, : incorrect number of dimensions My suspicion is that it is due to the fact that sometimes one or more of the elements in my conditional block is null, so nothing is returned and that this puts a null element into z. Here is a selection of a couple elements so you can see what is in 'z'. $`353.2010.0` mid year sample samplesize estimate sd loglik aic rate 353 2010 0 17 0.06463837 0.01567335 -63.5621 129.1242 bicchisq chisqpvalue chisqdf rate 129.9574 14.90239 0.001901994 3 $`355.2010.0` NULL $`376.2010.0` mid year sample samplesize estimate sdloglik aic rate 376 2010 0 6 0.07228863 0.02950606 -21.76253 45.52506 bicchisq chisqpvalue chisqdf rate 45.31682 16.46848 4.946565e-05 1 You see the value for rowname = `355.2010.0` is NULL., and it is my guess that this leads to the error I show above. But I can't confirm that yet, because I don't yet know how to get rid of rows that have a row name but only NULL as the value. I haven't seen this dealt with in the references I have read so far. I think I may be able to deal with it by creating dummy values for the fields the data frame requires, and then use SQL to remove them, but I'd rather not have to resort to that if I can avoid it. I can't believe there isn't something in the base package for R that would easily handle this, but not knowing the name of the function to look at, I haven't found it yet. Any information would be appreciated. Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] One problem with RMySQL and a query that returns an empty recordset
My last query related to this referred to a problem with not being able to store data. A suggestion was made to try to convert the data returned by fitdist into a data.frame before using rbind. That failed, but provided the key to solving the problem (which was to create a data.frame using the variables fitdist produces in the object it returns). I now have almost everything working as intended. However, there is one problem. Here is the error: 'data.frame':0 obs. of 0 variables Error in `[.data.frame`(moreinfo, , 1) : undefined columns selected Calls: [ - [.data.frame Execution halted the curious thing is that this happens when my script is called from within perl. Within Rgui, the script continues through to the end, but the loop that is involved terminates at the line where this error occurs. The line that results in this error is: moreinfo - dbGetQuery(con, x) This statement occurs in a loop that ought to iterate over a few hundred values for m_id (see the SQL below). Because of the above error, I never see about two thirds of the results that ought to be produced. At the time that the error occurs, x contains the following SQL query: SELECT m_id,sale_date,YEAR(sale_date) AS sale_year,MONTH(sale_date) AS sale_month,return_type,0.0001 + DATEDIFF(return_date,sale_date) AS elapsed_time FROM `merchants2`.`risk_input` WHERE m_id = 361 AND return_type = 1 AND DATEDIFF(return_date,sale_date) IS NOT NULL; If I execute this SQL, I find the resultset is empty. So assigning the value returned by dbGetQuery to moreinfo works ONLY if the resultset is not empty. It fails with a fatal error if the resultset is empty. So, the question is, how can I revise that statement so that the assignment happens only if the resultset is NOT empty? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] I need help making a data.fame comprised of selected columns of an original data frame.
I must have missed something simple, but still, i don't know what. I obtained my basic data as follows: x - sprintf(SELECT m_id,sale_date,YEAR(sale_date) AS sale_year,WEEK(sale_date) AS sale_week,return_type,0.0001 + DATEDIFF(return_date,sale_date) AS elapsed_time FROM `merchants2`.`risk_input` WHERE DATEDIFF(return_date,sale_date) IS NOT NULL) moreinfo - dbGetQuery(con, x) I then made the data frame I want to use as follows: fun_m_id - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$mid = df$m_id[1] rv } aaa - lapply(split(moreinfo,list(moreinfo$m_id),drop = TRUE), fun_m_id) m_id_default_res - do.call(rbind, aaa) At this point, each row in m_id_default_res corresponds to one data.frame produced by fitdist. When I print it, I get the output I expected. However, I need to store only some of it into my DB. And then, because fitdist produces a data frame that includes a lot of info I don't need to store in the DB, I tried making a new data.frame containing only the info I need as follows: ndf = data.frame() for (i in 1:length(m_id_default_res[,1])) { ndf$mid[i] = m_id_default_res$mid[i] ndf$estimate[i] = m_id_default_res$estimate[i] ndf$sd[i] = m_id_default_res$sd[i] ndf$n[i] = m_id_default_res[i] ndf$loglik[i] = m_id_default_res$loglik[i] ndf$aic[i] = m_id_default_res$aic[i] ndf$bic[i] = m_id_default_res$bic[i] ndf$chisq[i] = m_id_default_res$chisq[i] ndf$chisqpvalue[i] = m_id_default_res$chisqpvalue[i] ndf$chisqdf[i] = m_id_default_res$chisqdf[i] } ndf And I get the following error: Error in `$-.data.frame`(`*tmp*`, n, value = list(0.114752782316094)) : replacement has 1 rows, data has 0 I need to either get rid of the columns in m_id_default_res that I don't need, or I need to copy only those columns I need to a new data.frame. How do I do this. Obviously, doing an element-wise copy, at least as I tried to do it, doesn't work. Thanks, Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] I need help making a data.fame comprised of selected columns of an original data frame.
Hi Steve, Thanks Here is a tiny subset of the data: dput(head(moreinfo, 40)) structure(list(m_id = c(171, 206, 206, 206, 206, 206, 206, 218, 224, 224, 227, 229, 229, 229, 229, 229, 229, 229, 229, 233, 233, 238, 238, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251), sale_date = c(2008-04-25 07:41:09, 2008-05-09 20:58:12, 2008-09-06 19:51:52, 2008-05-01 21:26:40, 2008-08-06 23:53:17, 2008-05-29 18:44:50, 2008-05-16 16:10:52, 2008-12-30 17:59:54, 2008-11-06 18:15:40, 2008-09-05 17:43:51, 2008-10-31 21:55:52, 2008-04-30 21:30:36, 2008-11-11 00:43:54, 2008-07-24 22:26:29, 2008-10-07 17:57:22, 2008-04-23 20:39:41, 2008-09-08 22:42:12, 2008-11-13 00:09:59, 2008-04-15 22:57:31, 2008-07-05 08:52:58, 2008-10-04 13:17:02, 2008-03-20 23:02:12, 2008-08-08 16:48:42, 2008-06-04 04:31:20, 2008-09-27 07:02:14, 2008-09-08 07:16:39, 2008-09-25 07:09:11, 2008-09-23 07:02:39, 2008-08-09 07:31:46, 2008-09-28 07:02:13, 2008-07-05 07:26:46, 2008-05-11 04:01:55, 2008-06-26 07:46:17, 2008-07-09 07:36:16, 2008-07-21 18:36:44, 2008-10-11 07:01:36, 2008-07-21 19:03:42, 2008-05-07 04:21:23, 2008-10-14 07:07:02, 2008-05-12 04:26:21 ), sale_year = c(2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L), sale_week = c(16L, 18L, 35L, 17L, 31L, 21L, 19L, 52L, 44L, 35L, 43L, 17L, 45L, 29L, 40L, 16L, 36L, 45L, 15L, 26L, 39L, 11L, 31L, 22L, 38L, 36L, 38L, 38L, 31L, 39L, 26L, 19L, 25L, 27L, 29L, 40L, 29L, 18L, 41L, 19L ), return_type = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), elapsed_time = c(1e-04, 1e-04, 3.0001, 4.0001, 21.0001, 5.0001, 24.0001, 1.0001, 8.0001, 1e-04, 1e-04, 8.0001, 14.0001, 55.0001, 35.0001, 1e-04, 1e-04, 4.0001, 1e-04, 2.0001, 5.0001, 1e-04, 52.0001, 4.0001, 28.0001, 49.0001, 34.0001, 72.0001, 5.0001, 53.0001, 128.0001, 8.0001, 2.0001, 55.0001, 1.0001, 12.0001, 46.0001, 30.0001, 12.0001, 12.0001)), .Names = c(m_id, sale_date, sale_year, sale_week, return_type, elapsed_time), row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40), class = data.frame) The full dataset has almost 200,000 observations! That is why I hadn't posted the raw data. And m_id_default_res is even bigger because it includes all the original data along with the computed stats. Yes, the following line you pointed out has a typo: ndf$n[i] = m_id_default_res[i] It should have been ndf$n[i] = m_id_default_res$n[i] Correcting that makes the error go away, but at the end of the loop, ndf is said to have 0 columns and 0 rows. That I don't understand. But your statement (as corrected for the right source name) below does what I'd intended. ndf - m_id_default_res[, c('mid', 'estimate', 'sd', 'loglik', 'aic','bic', 'chisq', 'chisqpvalue', 'chisqdf')] Thanks Ted On Fri, Jul 16, 2010 at 12:04 PM, Steve Lianoglou mailinglist.honey...@gmail.com wrote: Hi, First: it's kind of hard to play along w/o some reproducible data. To that end, you can paste into an email the output of: dput(moreinfo) If there are lots of rows in `moreinfo`, just give us the first ~10-20 dput(head(moreinfo, 20)) Anyway: snip At this point, each row in m_id_default_res corresponds to one data.frame produced by fitdist. When I print it, I get the output I expected. However, I need to store only some of it into my DB. And then, because fitdist produces a data frame that includes a lot of info I don't need to store in the DB, I tried making a new data.frame containing only the info I need as follows: ndf = data.frame() for (i in 1:length(m_id_default_res[,1])) { ndf$mid[i] = m_id_default_res$mid[i] ndf$estimate[i] = m_id_default_res$estimate[i] ndf$sd[i] = m_id_default_res$sd[i] ndf$n[i] = m_id_default_res[i] ndf$loglik[i] = m_id_default_res$loglik[i] ndf$aic[i] = m_id_default_res$aic[i] ndf$bic[i] = m_id_default_res$bic[i] ndf$chisq[i] = m_id_default_res$chisq[i] ndf$chisqpvalue[i] = m_id_default_res$chisqpvalue[i] ndf$chisqdf[i] = m_id_default_res$chisqdf[i] } Forget the for loop. How about: ndf - m_id_default[, c('mid, 'estimate', 'sd', 'loglik', 'aic', 'bic', 'chisq', 'chisqpvalue', 'chisqdf') Having just written that, I see something strange in your for loop. Specifically this line: ndf$n[i] = m_id_default_res[i] m_id_default_res is a data.frame, right? Why don't you try to see what `m_id_default_res[1]` returns. I'm not sure that that's what your error message is coming from, but I foresee this to be a problem anyway, if I follow your build up code correctly. Hope that helps, -- Steve Lianoglou Graduate Student: Computational Systems
[R] Elementary question about computing confidence intervals.
I would have thought this to be relatively elementary, but I can't find it mentioned in any of my stats texts. Please consider the following: library(fitdistrplus) fp = fitdist(y,exp); rate = fp$estimate; sd = fp$sd fOneWeek = exp(-rate*7); #fraction that happens within a week - y is measured in days fr = exp(-rate*dt); #fraction remaining - dt = elapsed time from time of sample to present fh = 1 - fr; # fraction that occurred from time of sample to present # assume n = total number that have happened from time of sample to present T = n / fh # t is the total number at y = 0 NR = fr * T NNW = NR * (1 - fOneWeek) (If you wanted to run this, just populate y with random numbers from an exponential distribution.) What I show here simply extracts an estimate and standard deviation from the data.frame returned by fitdist, and tries to compute a number of integrals. What I need is the number of events that can be expected next week, next month, and from now to the end of time. Unless I have gone senile in my old age, I have the integrals correct. Please correct me if I missed something. But what I need help (to refresh my memory - I used to know this way back in the stone age) to compute the confidence intervals for each of these integrals. So I don't bother anyone with similar elementary questions, what web resource exists that defines confidence intervals for such integrals for arbitrary distributions? or does such a resource exist? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Troubles with DBI's dbWriteTable in RMySQL
I am feeling rather dumb right now. I created what I thought was a data.frame as follows: aaa - lapply(split(moreinfo,list(moreinfo$m_id),drop = TRUE), fun_m_id) m_id_default_res - do.call(rbind, aaa) print(==) m_id_default_res print(==) ndf - m_id_default_res[, c('mid', 'estimate', 'sd', 'loglik', 'aic','bic', 'chisq', 'chisqpvalue', 'chisqdf')] ndf The data in NDF is perfect, exactly what I expected when I print the contents as shown in the last statement above. On the asumption tha tthat is a data frame, I tried dbWriteTable(con,test1,ndf); But I received the following error: Error in function (classes, fdef, mtable) : unable to find an inherited method for function dbWriteTable, for signature MySQLConnection, character, matrix Then, on the assumption it is trivial to convert a matrix into a data.frame, i tried: dbWriteTable(con,test2,as.data.frame(ndf)); But this produced the following error: Error in write.table(x, file, nrow(x), p, rnames, sep, eol, na, dec, as.integer(quote), : unimplemented type 'list' in 'EncodeElement' The silly, and frustrating, thing is that I used dbWriteTable before, and that worked adequately. But that was with a simple data frame (within a for loop, element by element - res$var[[i]] = expression), not the result of do.call(rbind(...)) The principle limitation I saw in my previous use of dbWriteTable is that all fields are given the type 'TEXT', and that it insists on creating a new table. What I'd prefer is a kind of bulk interset that just makes extra records for an existing table. So, given my past experience with dbWriteTable, it is a question of what do.call(rbind(..)) did to produce ndf that has the effect that dbWriteTable doesn't like that data.frame. So, then, what is the bext way to either get dbWriteTable working (ideally in a way that works around the limitations I mention above) or to do a bulk insert into my MySQL table (yes, I already have a table in the relevant schema with all the right data types for each field, and I load RMySQL at the start of my program.) In a worst case, I can live with an insertion one record at a time. Thanks Ted PS: If it helps, here is the the contents of ndf - as shown by entering 'ndf' at the R prompt: ndf mid estimate sd loglikaic bic chisq chisqpvalue chisqdf 206 206 0.1147528 0.04336918 -22.15483 46.30965 46.25556 4.433502 0.035240131 229 229 0.0736 0.01999671 -56.41179 114.8236 115.5962 195307.1 0 2 251 251 0.074421 0.002171616 -4224.072 8450.144 8455.212 593302.2 0 18 252 252 0.03710208 0.0004556731 -28426.82 56855.65 56862.45 3543373 0 38 253 253 0.01397349 0.0005900857 -2925.179 5852.358 5856.677 283.9848 5.232282e-51 16 254 254 0.09043846 0.01528502 -119.108 240.216 241.7713 23.52441 3.139385e-05 3 255 255 0.05078883 0.0006021373 -28294.38 56590.76 56597.63 1988844 0 35 260 260 0.03392846 0.005499136 -166.5730 335.1461 336.7837 10.83060 0.054844135 268 268 0.05357114 0.01785082 -35.3407 72.6814 72.87863 82995.79 0 2 286 286 0.09321947 0.01987217 -74.20157 150.4031 151.4942 1.698603 0.6372445 3 290 290 0.03841793 0.006584153 -144.8139 291.6277 293.1541 135.8937 2.902434e-29 3 292 292 0.06289269 0.01988338 -37.66325 77.32651 77.6291 143099.8 0 2 297 297 0.01674874 0.004047625 -86.52035 175.0407 175.8739 47.27713 3.034432e-10 3 302 302 0.02878066 0.003876092 -250.1428 502.2857 504.293 9.22447 0.2369393 7 306 306 0.07904849 0.0004164051 -127449.0 254899.9 254908.4 111574416 0 40 307 307 0.01655872 0.001320903 -795.7314 1593.463 1596.513 57.38622 1.127804e-08 10 308 308 0.02631102 0.000884155 -4095.149 8192.298 8197.081 142.8876 3.904898e-20 21 309 309 0.09891599 0.0084501-453.9474 909.8947 912.8147 357135.5 0 8 310 310 0.09332047 0.004580396 -1399.262 2800.524 2804.552 217126 0 13 311 311 0.06378327 0.0005049166 -59848.62 119699.2 119706.9 59481893 0 34 313 313 0.06203001 0.0006486936 -34546.67 69095.34 69102.46 18207698 0 32 316 316 0.173 0.07026985 -25.04100 52.08199 52.38458 18002.22 0 2 317 317 0.04405086 0.0005949207 -22578.44 45158.88 45165.49 8923236 0 33 320 320 0.05747093 0.006634162 -289.2357 580.4714 582.7889 8.641322 0.2794433 7 321 321 0.06365155 0.003692525 -1115.037 2232.073 2235.767 19.10553 0.0860133712 322 322 0.05737672 0.01532991 -54.01363 110.0273 110.6663 9.597753 0.008238998 2 323 323 0.03116934 0.001909146 -1188.573 2379.146 2382.73 109.7663 6.656046e-18 12 324 324 0.03027327 0.0004146385 -23922.15 47846.3 47852.88 47330365 0 32 325 325 0.06047783 0.00922026 -163.6356 329.2711 331.0323 1695781 0 3 326 326 0.05627898 0.0008642285 -16432.57 32867.13 32873.48 3405089 0 29 327 327 0.07052627
[R] How do I combine lists of data.frames into a single data frame?
The data.frame is constructed by one of the following functions: funweek - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_week[1] rv$granularity = week rv } funmonth - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_month[1] rv$granularity = month rv } It is basically the data.frame created by fitdist extended to include the variables used to distinguish one sample from another. I have the following statement that gets me a set of IDs from my db: ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input) And then I have a loop that allows me to analyze one dataset after another: for (i in 1:length(ids[,1])) { print(i) print(ids[i,1]) Then, after a set of statements that give me information about the dataset (such as its size), within a conditional block that ensures I apply the analysis only on sufficiently large samples, I have the following: z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop = TRUE), funweek) or z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop = TRUE), funmonth) followed by: str(z) Of course, I close the loop and disconnect from my db. NB: I don't see any way to get rid of the loop by adding ID as a factor to split because I have to query the DB for several key bits of data in order to determine whether or not there is sufficient data to work on. I have everything working, except the final step of storing the results back into the db. Storing data in the Db is easy enough. But I am at a loss as to how to combine the lists placed in z in most of the iterations through the ID loop into a single data.frame. Now, I did take a look at rbind and cbind, but it isn't clear to me if either is appropriate. All the data frames have the same structure, but the lists are of variable length, and I am not certain how either might be used inside the IDs loop. So, what is the best way to combine all lists assigned to z into a single data.frame? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How do I combine lists of data.frames into a single data frame?
Thanks Marc The next part of the question, though, involves the fact that there is a new 'z' list made in almost every iteration through the ID loop. I guess there are two parts to the question. First, how would I make a list containing all the data frames created by a call to rbind? I assume, then, that I could call rbind again to make that new list into a single data.frame. Second, is it possible to just append one list of objects to another list of objects, and would doing that and calling rbind on that master list be more efficient than calling rbind on each z list and then calling rbind after the loop on the list of such data.frames? Thanks again, Ted On Thu, Jul 15, 2010 at 3:27 PM, Marc Schwartz marc_schwa...@me.com wrote: On Jul 15, 2010, at 2:18 PM, Ted Byers wrote: The data.frame is constructed by one of the following functions: funweek - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_week[1] rv$granularity = week rv } funmonth - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_month[1] rv$granularity = month rv } It is basically the data.frame created by fitdist extended to include the variables used to distinguish one sample from another. I have the following statement that gets me a set of IDs from my db: ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input) And then I have a loop that allows me to analyze one dataset after another: for (i in 1:length(ids[,1])) { print(i) print(ids[i,1]) Then, after a set of statements that give me information about the dataset (such as its size), within a conditional block that ensures I apply the analysis only on sufficiently large samples, I have the following: z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop = TRUE), funweek) or z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop = TRUE), funmonth) followed by: str(z) Of course, I close the loop and disconnect from my db. NB: I don't see any way to get rid of the loop by adding ID as a factor to split because I have to query the DB for several key bits of data in order to determine whether or not there is sufficient data to work on. I have everything working, except the final step of storing the results back into the db. Storing data in the Db is easy enough. But I am at a loss as to how to combine the lists placed in z in most of the iterations through the ID loop into a single data.frame. Now, I did take a look at rbind and cbind, but it isn't clear to me if either is appropriate. All the data frames have the same structure, but the lists are of variable length, and I am not certain how either might be used inside the IDs loop. So, what is the best way to combine all lists assigned to z into a single data.frame? Thanks Ted Ted, If each of the data frames in the list 'z' have the same column structure, you can use: do.call(rbind, z) The result of which will be a single data frame containing all of the rows from each of the data frames in the list. HTH, Marc Schwartz -- R.E.(Ted) Byers, Ph.D.,Ed.D. t...@merchantservicecorp.com CTO Merchant Services Corp. 350 Harry Walker Parkway North, Suite 8 Newmarket, Ontario L3Y 8L3 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How do I combine lists of data.frames into a single data frame?
Byers wrote: Thanks Marc The next part of the question, though, involves the fact that there is a new 'z' list made in almost every iteration through the ID loop. I guess there are two parts to the question. First, how would I make a list containing all the data frames created by a call to rbind? I assume, then, that I could call rbind again to make that new list into a single data.frame. Second, is it possible to just append one list of objects to another list of objects, and would doing that and calling rbind on that master list be more efficient than calling rbind on each z list and then calling rbind after the loop on the list of such data.frames? Thanks again, Ted On Thu, Jul 15, 2010 at 3:27 PM, Marc Schwartz marc_schwa...@me.com wrote: On Jul 15, 2010, at 2:18 PM, Ted Byers wrote: The data.frame is constructed by one of the following functions: funweek - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_week[1] rv$granularity = week rv } funmonth - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_month[1] rv$granularity = month rv } It is basically the data.frame created by fitdist extended to include the variables used to distinguish one sample from another. I have the following statement that gets me a set of IDs from my db: ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input) And then I have a loop that allows me to analyze one dataset after another: for (i in 1:length(ids[,1])) { print(i) print(ids[i,1]) Then, after a set of statements that give me information about the dataset (such as its size), within a conditional block that ensures I apply the analysis only on sufficiently large samples, I have the following: z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop = TRUE), funweek) or z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop = TRUE), funmonth) followed by: str(z) Of course, I close the loop and disconnect from my db. NB: I don't see any way to get rid of the loop by adding ID as a factor to split because I have to query the DB for several key bits of data in order to determine whether or not there is sufficient data to work on. I have everything working, except the final step of storing the results back into the db. Storing data in the Db is easy enough. But I am at a loss as to how to combine the lists placed in z in most of the iterations through the ID loop into a single data.frame. Now, I did take a look at rbind and cbind, but it isn't clear to me if either is appropriate. All the data frames have the same structure, but the lists are of variable length, and I am not certain how either might be used inside the IDs loop. So, what is the best way to combine all lists assigned to z into a single data.frame? Thanks Ted Ted, If each of the data frames in the list 'z' have the same column structure, you can use: do.call(rbind, z) The result of which will be a single data frame containing all of the rows from each of the data frames in the list. HTH, Marc Schwartz -- R.E.(Ted) Byers, Ph.D.,Ed.D. t...@merchantservicecorp.com CTO Merchant Services Corp. 350 Harry Walker Parkway North, Suite 8 Newmarket, Ontario L3Y 8L3 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- R.E.(Ted) Byers, Ph.D.,Ed.D. t...@merchantservicecorp.com CTO Merchant Services Corp. 350 Harry Walker Parkway North, Suite 8 Newmarket, Ontario L3Y 8L3 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] exercise in frustration: applying a function to subsamples
From the documentation I have found, it seems that one of the functions from package plyr, or a combination of functions like split and lapply would allow me to have a really short R script to analyze all my data (I have reduced it to a couple hundred thousand records with about half a dozen records. I get the same result from ddply and split/lapply: ddply(moreinfo,c(m_id,sale_year,sale_week), + function(df) data.frame(res = fitdist(df$elapsed_time,exp),est = res$estimate,sd = res$sd)) Error in fitdist(df$elapsed_time, exp) : data must be a numeric vector of length greater than 1 and lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)), + function(df) fitdist(df$elapsed_time,exp)) Error in fitdist(df$elapsed_time, exp) : data must be a numeric vector of length greater than 1 Now, in retrospect, unless I misunderstood the properties of a data.frame, I suppose a data.frame might not have been entirely appropriate as the m_id samples start and end on very different dates, but I would have thought a list data structure should have been able to handle that. It would seem that split is making groups that have the same start and end dates (or that if, for example, I have sale data for precisely the last year, split would insist on both 2009 and 2010 having weeks from 0 through 52 instead of just the weeks in each year that actually have data: 26 through 52 for last year and 1 through 25 for this year). I don't see how else the data passed to fitdist could have a sample size of 0. I'd appreciate understanding how to resolve this. However, it isn't s show stopper as it now seems trivial to just break it out into a loop (followed by a lapply/split combo using only sale year and sale month). While I am asking, is there a better way to split such temporally ordered data into weekly samples that respective the year in which the sample is taken as well as the week in which it is taken? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] exercise in frustration: applying a function to subsamples
OK, here is a stripped down variant of my code. I can run it here unchanged (apart from the credentials for connecting to my DB). Sys.setenv(MYSQL_HOME='C:/Program Files/MySQL/MySQL Server 5.0') library(TSMySQL) library(plyr) library(fitdistrplus) con - dbConnect(MySQL(), user=rejbyers, password=jesakos, dbname=merchants2) x - sprintf(SELECT m_id,sale_date,YEAR(sale_date) AS sale_year,WEEK(sale_date) AS sale_week,return_type,0.0001 + DATEDIFF(return_date,sale_date) AS elapsed_time FROM `risk_input` WHERE DATEDIFF(return_date,sale_date) IS NOT NULL) x moreinfo - dbGetQuery(con, x) str(moreinfo) #moreinfo #print(moreinfo) dbDisconnect(con) f1 - fitdist(moreinfo$elapsed_time,exp); summary(f1) lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week),drop = TRUE), function(df) fitdist(df$elapsed_time,exp)) I guess that for others to run this script, it is just necessary to create some sample data, consisting of two or more m_id values (I have several hundred), and temporally ordered data for each. I am not familiar enough with R to know how to do that using R.Usually, if I need dummy data, I make it with my favourite rng using either C++ or Perl. I am still trying to get used to R. Each record in my data has one random variate and a MySQL TIMESTAMP (nn-nn- nn:nn:nn), anywhere from hundreds to thousands each week for anywhere from a few months to several years. My SQL actually produces the random variate by taking the difference between the sale date and return date, and is structured as it is because I know how to group by year and week from a timestamp field using SQL but didn't know how to accomplish the same thing in R. The statement 'x' by itself, always shows me the correct SQL statement to get the data (I can execute it unchanged in the mysql commandline client). 'str(moreinfo)' always gives me the data structure I expect. E.g.: str(moreinfo) 'data.frame': 177837 obs. of 6 variables: $ m_id: num 171 206 206 206 206 206 206 218 224 224 ... $ sale_date : chr 2008-04-25 07:41:09 2008-05-09 20:58:12 2008-09-06 19:51:52 2008-05-01 21:26:40 ... $ sale_year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ... $ sale_week : int 16 18 35 17 31 21 19 52 44 35 ... $ return_type : num 1 1 1 1 1 1 1 1 1 1 ... $ elapsed_time: num 0.0001 0.0001 3.0001 4.0001 21.0001 ... 'summary(f1)' shows me the results I expect from the aggregate data. E.g.: summary(f1) FITTING OF THE DISTRIBUTION ' exp ' BY MAXIMUM LIKELIHOOD PARAMETERS estimate Std. Error rate 0.0652917 0.0001547907 Loglikelihood: -663134.7 AIC: 1326271 BIC: 1326281 -- GOODNESS-OF-FIT STATISTICS _ Chi-squared_ Chi-squared statistic: 400277239 Degree of freedom of the Chi-squared distribution: 56 Chi-squared p-value: 0 !!! the p-value may be wrong with some theoretical counts 5 !!! !!! For continuous distributions, Kolmogorov-Smirnov and Anderson-Darling statistics should be prefered !!! _ Kolmogorov-Smirnov_ Kolmogorov-Smirnov statistic: 0.1660987 Kolmogorov-Smirnov test: rejected !!! The result of this test may be too conservative as it assumes that the distribution parameters are known !!! _ Anderson-Darling_ Anderson-Darling statistic: Inf Anderson-Darling test: rejected And at the end, I get the error I mentioned. NB: In this variant, I added drop = TRUE as Jim suggested. lapply(split(all_samples,list(all_samples$m_id,all_samples$sale_year,all_samples$sale_week),drop = TRUE), + function(df) fitdist(df$elapsed_time,exp)) Error in fitdist(df$elapsed_time, exp) : data must be a numeric vector of length greater than 1 If, then, drop = TRUE results in all empty combinations of m_id, year and week being excluded, then (noticing the requirement is actually that the sample size be greater than 1), I can only conclude that at least one of the samples has only 1 record. But that is too small. Is there a way to allow the above code to apply fitdist only if the sample size of a given subsample is greater than, say, 100? Even better, is there a way to make the split more dynamic, so that it groups a given m_id's data by month if the average weekly subsample size is less than 100, or by day if the average weekly subsample is greater than 1000? Thanks Ted On Mon, Jul 12, 2010 at 3:20 PM, Erik Iverson er...@ccbr.umn.edu wrote: Your code is not reproducible. Can you come up with a small example showing the crux of your data structures/problem, that we can all run in our R sessions? You're likely get much higher quality responses this way. Ted Byers wrote: From the documentation I have found, it seems that one of the functions from package plyr, or a combination of functions like split and lapply would allow me to have a really short R script to analyze all my data (I have
Re: [R] exercise in frustration: applying a function to subsamples
Thanks Jim, I acted on your suggestion and found the result unchanged. :-( Then I noticed that fitdist doesn't like a sample size of 1 either. If, then, drop = TRUE results in all empty combinations of m_id, year and week being excluded, then (noticing the requirement is actually that the sample size be greater than 1), I can only conclude that at least one of the samples has only 1 record. I hadn't realized that some of the subsamples were that small. In my reply to Erik, I wrote: But that is too small. Is there a way to allow the above code to apply fitdist only if the sample size of a given subsample is greater than, say, 100? Even better, is there a way to make the split more dynamic, so that it groups a given m_id's data by month if the average weekly subsample size is less than 100, or by day if the average weekly subsample is greater than 1000? Thanks Ted On Mon, Jul 12, 2010 at 4:02 PM, jim holtman jholt...@gmail.com wrote: try 'drop=TRUE' on the split function call. This will prevent the NULL set from being sent to the function. On Mon, Jul 12, 2010 at 3:10 PM, Ted Byers r.ted.by...@gmail.com wrote: From the documentation I have found, it seems that one of the functions from package plyr, or a combination of functions like split and lapply would allow me to have a really short R script to analyze all my data (I have reduced it to a couple hundred thousand records with about half a dozen records. I get the same result from ddply and split/lapply: ddply(moreinfo,c(m_id,sale_year,sale_week), + function(df) data.frame(res = fitdist(df$elapsed_time,exp),est = res$estimate,sd = res$sd)) Error in fitdist(df$elapsed_time, exp) : data must be a numeric vector of length greater than 1 and lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)), + function(df) fitdist(df$elapsed_time,exp)) Error in fitdist(df$elapsed_time, exp) : data must be a numeric vector of length greater than 1 Now, in retrospect, unless I misunderstood the properties of a data.frame, I suppose a data.frame might not have been entirely appropriate as the m_id samples start and end on very different dates, but I would have thought a list data structure should have been able to handle that. It would seem that split is making groups that have the same start and end dates (or that if, for example, I have sale data for precisely the last year, split would insist on both 2009 and 2010 having weeks from 0 through 52 instead of just the weeks in each year that actually have data: 26 through 52 for last year and 1 through 25 for this year). I don't see how else the data passed to fitdist could have a sample size of 0. I'd appreciate understanding how to resolve this. However, it isn't s show stopper as it now seems trivial to just break it out into a loop (followed by a lapply/split combo using only sale year and sale month). While I am asking, is there a better way to split such temporally ordered data into weekly samples that respective the year in which the sample is taken as well as the week in which it is taken? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? -- R.E.(Ted) Byers, Ph.D.,Ed.D. t...@merchantservicecorp.com CTO Merchant Services Corp. 350 Harry Walker Parkway North, Suite 8 Newmarket, Ontario L3Y 8L3 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] I need guidance on better data management in preparation for time series analysis
OK, I have managed to use some of the basic processes of getting data from my DB, passing it as a whole to something like fitdistr, c. I know I can implement most of what I need using a brute force algorithm based on a series of nested loops. I also know I can handle some of this logic in a brute force method using a blend of perl and R, with considerable file IO. But some of what I need needs a smarter/faster way. To understand what I am after, consider the following. I have transaction data comprised of sales and refunds, each of which has a timestamp. The refund data has a timestamp representing when the refund was issued and an original transaction ID representing the sale it refunds. I have massaged this data in my schema so that there is a table that has a record for each refund, and this record includes, among other things, the timestamps for both the original sale and the refund. I can construct a SQL query to get these along with the elapsed time (in days, as a real number) between the sale and refund. For some merchants, I have such data going back years. I know, fromt he amount of data I have examined, the rate at which sales result in refunds changes through time, though I have not run tests to determine whether or not the changes I see are significant. In most cases, I can break the data for a merchant into weekly subsamples. Obviously, I can construct loops that iterate over merchant ID, and year/week (or day) covering the entire period for which I have data for a given merchant. What I am asking is, Is there a smarter way? I can't load all the data as there are many GB of data, but the data for individual merchants varies from a few hundred kB to a few dozen MB. Thus, I expect an outer loop iterating over merchant ID will be inevitable. But, is there a smarter way to apply fitdistr (or similar function) to samples represent sales in each week of each year (or each day of the year when there is sufficient data), and then test to see if the parameter of the exponential distribution that best fits the data varies significantly through time (there are both theoretical and empirical reasons to expect an exponential distribution, but the specific distribution doesn't really matter for the purpose of this question). That is one question I need to deal with. Is there a simple way to specify a function, a dataset and a rule for determining all the subsamples, and then tell R to apply the function to each subsample and then say whether or not the estimated parameters for the subsample are significantly different? Or do I have to resort to the simple brute force approach of using a set of nested loops to get what I need? The other question I have at present is more a statistical question: Integrating an exponential pdf over a given time period is simple enough, but I need to learn how confidence intervals for that integral to be computed when you have the estimate and std of the parameter for the exponential distribution from something like fitdistr. This gets to how to get confidence intervales when dealing with integrals of functions of uncertain numbers. Not only is there a confidence interval for the parameter of the exponential distribution, but to estimate how many refunds to expect for the next week, one not only needs the confidence intervals of the integral of the pdf over the next week for a given sample, but one needs to integrate this over all the samples that could produce a refund in the coming week. I'd appreciate any information anyone can provide, even if that consists of an URL that points to a resource that deals with the specific questions I have. I am afraid all the resources I have found searching so far have been at a more introductory level of simply making a connection to a DB and then submitting a SQL statement to it. Something in between that level and the level comprised of the maze of documentation for the plethora of relevant packages is needed here (there is such an embarrassment of riches, I find myself getting confused as to how to proceed). Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Can RMySQL be used for a paramterized query?
Thanks, Actually, thanks to the info Henrique sent, I have made decent progress. In actuality, I could have just submitted a SELECT * on the second table, which would give me everything, just like Henrique's suggestion, and yours, would give me. The problem is that that table is HUGE (I don't want to load ALL that data at once, especially when I'd be analyzing it in chunks defined by ID and date), and at the same time, the analyses would be done not only one ID at a time, but on records pertaining to a given day. (e.g., imagine a dataset containing al sales and refund data, and assuming rates at which sales end in refunds vary through time, something I know from previous analyses of similar data, I would need to analyze all refunds for sales that happened on a given day). While I was aware I could use RMySQL to get my time series data (I will be assessing a VAR on a 3D time series once my current task is done), I looked at TSMySQL because, being relatively inexperienced with R, I need to be able to do a variety of autoregressive analyses. Someone has suggested I also look at state space modelling, but being a mathematical ecologst by training, I am struggling with that along with Kalman Filtering. But that is another post ... Thanks again Ted On Thu, Jun 10, 2010 at 5:51 PM, Paul Gilbert pgilb...@bank-banque-canada.ca wrote: Ted I'm not sure I fully understand the question, but you may want to consider creating a temporary table with a join, which you can do with a query from your R session, and then query that table to bring the data into R. Roughly, the logic is to leave the data in the db if you are not doing any fancy calculations. You might also find order by is useful. (This is what I use in TSdbi to make sure data comes back in the right order as a time series.) It may even be possible to get everything you want back in one step using this and group by, rather than looping, but depending on the analysis you want to do in R, that may not be the most convenient way. BTW, I think you realize you do not have to use the TSMySQL commands to access the TSMySQL database. They are usually convenient, but you can query the tables directly with RMySQL functions. Paul -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Henrique Dallazuanna Sent: June 10, 2010 8:47 AM To: Ted Byers Cc: R-help Forum Subject: Re: [R] Can RMySQL be used for a paramterized query? I think you can do this: ids - dbGetQuery(conn, SELECT id FROM my_table) other_table - dbGetQuery(conn, sprintf(SELECT * FROM my_other_table WHERE t1_id in (%s), paste(ids, collapse = ,))) On Wed, Jun 9, 2010 at 11:24 PM, Ted Byers r.ted.by...@gmail.com wrote: I have not found anything about this except the following from the DBI documentation : Bind variables: the interface is heavily biased towards queries, as opposed to general purpose database development. In particular we made no attempt to define bind variables; this is a mechanism by which the contents of R/S objects are implicitly moved to the database during SQL execution. For instance, the following embedded SQL statement /* SQL */ SELECT * from emp_table where emp_id = :sampleEmployee would take the vector sampleEmployee and iterate over each of its elements to get the result. Perhaps the DBI could at some point in the future implement this feature. I can connect, and execute a SQL query such as SELECT id FROM my_table, and display a frame with all the IDs from my_table. But I need also to do something like SELECT * FROM my_other_table WHERE t1_id = x where 'x' is one of the IDs returned by the first select statement. Actually, I have to do this in two contexts, one where the data are not ordered by time and one where it is (and thus where I'd have to use TSMySQL to execute something like SELECT record_datetime,value FROM my_ts_table WHERE t2_id = x). I'd like to embed this in a loop where I iterate over the IDs returned by the first select, get the appropriate data from the second for each ID, analyze that data and store results in another table in the DB, and then proceed to the next ID in the list. I suppose an alternative would be to get all the data at once, but the resulting resultset would be huge, and I don't (yet) know how to take a subset of the data in a frame based on a given value in one ot the fields and analyze that. Can you point me to an example of how this is done, or do I have to use a mix of perl (to get the data) and R (to do the analysis)? Any insights on how to proceed would be appreciated. Thanks. Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
[R] Can RMySQL be used for a paramterized query?
I have not found anything about this except the following from the DBI documentation : Bind variables: the interface is heavily biased towards queries, as opposed to general purpose database development. In particular we made no attempt to define bind variables; this is a mechanism by which the contents of R/S objects are implicitly moved to the database during SQL execution. For instance, the following embedded SQL statement /* SQL */ SELECT * from emp_table where emp_id = :sampleEmployee would take the vector sampleEmployee and iterate over each of its elements to get the result. Perhaps the DBI could at some point in the future implement this feature. I can connect, and execute a SQL query such as SELECT id FROM my_table, and display a frame with all the IDs from my_table. But I need also to do something like SELECT * FROM my_other_table WHERE t1_id = x where 'x' is one of the IDs returned by the first select statement. Actually, I have to do this in two contexts, one where the data are not ordered by time and one where it is (and thus where I'd have to use TSMySQL to execute something like SELECT record_datetime,value FROM my_ts_table WHERE t2_id = x). I'd like to embed this in a loop where I iterate over the IDs returned by the first select, get the appropriate data from the second for each ID, analyze that data and store results in another table in the DB, and then proceed to the next ID in the list. I suppose an alternative would be to get all the data at once, but the resulting resultset would be huge, and I don't (yet) know how to take a subset of the data in a frame based on a given value in one ot the fields and analyze that. Can you point me to an example of how this is done, or do I have to use a mix of perl (to get the data) and R (to do the analysis)? Any insights on how to proceed would be appreciated. Thanks. Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] TS model
I am looking at a new project involving time series analysis. I know I can complete the tasks involving VARMA using either dse or mAr (and I think there are a couple others that might serve). However, there is one task that I am not sure of the best way to proceed. A simple example illustrates what I am after. If you think of a simple ballistic problem, with a vector describing current position in 3 dimensions, the components of that vector are simple functions of initial position, initial velocity (constants, for our purposes) and time. It is trivial calculus to compute these values at arbitrary time using only initial conditions and time. Of course, for such a simple problem, we know the equations of motion that we can use for this purpose. I want to use time series values to estimate a suitable vector valued function of time in a case where we know neither the equations of change nor the initial conditions (but where we have daily values going back many years). Actually, I don't really care much about the details of the function nearly as much as the first and second derivatives of the function with respect to time; and these derivatives have to be inferred from the model of the measurements as 'simle' functions of time. And as I do not want to assume the system is autonomous, I want to be able to repeat the analysis on a moving window wherein always the current day is designated as having s = 0 (I.E. the time variable used in the model estimated slides along that representing real time). I figure that if that window is short enough, a quadratic or cubic function of time will suffice. Finally, if the combination of first and second derivatives indicates that the first derivative will take a value of 0 at some point in the future, I want to estimate the number of days until that happens. (yes, I know I will need some sort of orthogonalization of the time variable in order to reduce problems of multicollinearity, but that I'd expect in any multivariate nonlinear regression). I don't know if this could be recast as a VARMA problem, or if so, how and how I'd get the answers to the questions of importance to me. I would welcome being enlightened on this, if there is an answer. The question is, Is there a package that already provides support for this 'out of the box', as it were, and if so which one, or do I have to construct code supporting it de novo? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] What does this warning mean: DLL attempted to change FPU control word from 8001f to 9001f
I started a brand new session in R 2.10.1 (on Windows). If it matters, I am running the community edition of MySQL 5.0.67, and it is all running fine. I am just beginning to examine the process of getting timer series data from one table in MySQL, computing moving averages and computing a selection of estimates based on relations among moving averages of different variates, and storing all the results in another table in MySQL. The very first thing I did in this session was execute the following two commands: Sys.setenv(MYSQL_HOME='c:/MySQL') library(RMySQL) The output I got was: Loading required package: DBI Warning message: In inDL(x, as.logical(local), as.logical(now), ...) : DLL attempted to change FPU control word from 8001f to 9001f Now, I write programs in relatively high level languages (C++, perl, Java, and now R), and NEVER even consider twiddling with FPU control words or playing with registers on the processor. I have never gotten this close to the hardware since I messed with video memory in the old days when I wrote computer based teaching materials on DOS and had to get acceptable performance out of the hardware available way back then.. Consequently, I have no idea what this warning means or what I ought to do about it. I assume the DLL it is referring to is libmySQL.dllhttp://www.stat.berkeley.edu/classes/s133/libmySQL.dll, which RMySQL needs. But I have no idea either why it would do what R says it is doing or why it matters to me, or what I ought to do about it. I'd appreciate any info you can provide. Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Upgrade process for libraries: can I use installed.packages on an old installation followed by install.packages in a new one
I tend to have a lot of packages installed, in part because of a wide diversity of interests and a disposition of examining different ways to accomplish a given task. I am looking for a better way to upgrade all my packages when I upgrade the version of R that I am running. On looking at support for installing and updating packages, I found these two: installed.packages() and install.packages() and it occurred to me that in principle I ought to be able to use the one in the original installation to get a list of packages I'm working with and and put its output into a plain text file that I can read in the new installation and pass to the other to ensure the new installation has a fresh installation of all the packages I want to work with. The question comes WRT the fact the output from installed.packages() does not coincide with the expected input for install.packages(). What would you recommend I do to select from the output from the former so the file I write that output to will have the information the latter wants for input? For example, will it work properly if I just write the package names installed.packages() returns to the file and ignore all the rest? It is not clear to me how I'd have it ignore those packages that are part of the core of R (or even if I need to worry about that - I did see some packages listed in the output from installed.packages() that are identified as being part of R 2.10.1, when I looked at using this procedure to set up R 2.11.0). NB: I am not suggesting the output from the one should coincide with the expected input for the other. Rather, I am asking advice on writing simple R scripts that I can run in the one to get a file that would be suitable input for the other that would together make a fresh installation of a new version automatically make a fresh installation of all the previously installed packages. Thanks Ted PS: I am using Windows XP, if that matters. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Upgrade process for libraries: can I use installed.packages on an old installation followed by install.packages in a new one
When doing a fresh install of a new version of R, using update.packages() requires copying some of the contents of the library subdirectory to the new installation. While possible and viable, it can be problematic in being tedious (more an irritation regarding how Windows handles copying directories from one location to another when there are already things in the target directory with the same names, than anything else), and there exists the possibility that there are some old packages that are obsolete and won't work properly in the new version. I don't suppose update.packages() will remove obsolete packages in the library directory if it finds them, does it? I have a preference for trying to do a fresh install of a given product's optional packages (so if a given package has a problem in the new version, it just doesn't install - rather than cluttering its directory tree with useless stuff); something that is trivially easy if looking at only a handful of optional packages but very tedious when there are so many. I know from experience that repeatedly having a large, complex piece of software, whether a major application (like R or MS Word, c.) or an OS like Windows, update/over-write key part of itself will eventually lead to hard to diagnose problems. It is often good to have more than one way to accomplish a given task, and there are usually many options to choose from when designing/implementing software. Actually, with the benefit of 20/20 hindsight, if I had been asked to write a update.packages() function, I would have had it look in the registery on Windows, or in the directory tree, for evidence of an older version of R (perhaps a version that is used only during a fresh install of R), and have it process the list of detected packages and install/upgrade any packages that will work with the new version of R, and perhaps, if a given obsolete package has been superceded by something else, make sure that 'something else' is installed instead, just so the directory tree for the new install is not cluttered with old, potentially broken, stuff. Thanks Ted On Thu, Apr 29, 2010 at 4:59 PM, Erik Iverson er...@ccbr.umn.edu wrote: Ted Byers wrote: I tend to have a lot of packages installed, in part because of a wide diversity of interests and a disposition of examining different ways to accomplish a given task. I am looking for a better way to upgrade all my packages when I upgrade the version of R that I am running. On looking at support for installing and updating packages, I found these two: installed.packages() and install.packages() and it occurred to me that in principle I ought to be able to use the one in the original installation to get a list of packages I'm working with and and put its output into a plain text file that I can read in the new installation and pass to the other to ensure the new installation has a fresh installation of all the packages I want to work with. I must be missing the obvious, but what's wrong with update.packages() ? -- R.E.(Ted) Byers, Ph.D.,Ed.D. t...@merchantservicecorp.com CTO Merchant Services Corp. 350 Harry Walker Parkway North, Suite 8 Newmarket, Ontario L3Y 8L3 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Error loading RMySQL
I have R 2.10.1 and 2.9.1 installed, and both have RMySQL packages installed. I script I'd developed using an older version (2.8.?, I think) used RMySQL too and an older version of MySQL (5.0.?), and worked fine at that time (about a year and a half ago +/- a month or two). But now, when I run it again, on new data, the script works fine until it is supposed to store its results in my DB. In actuality, I have a perl script that does an initial preparation of the data, and then invokes the R script - all of which work fine). The very last five lines of my script are: print(resultsdataframe); library(RMySQL); con - dbConnect(MySQL(),user=rejbyers,password=jesakos,dbname=merchants2); dbWriteTable(con,results,resultsdataframe); dbDisconnect(con); The print statement works fine, and shows me the results I expected. But library(RMySQL) fails, which makes all the rest of the lines fail. The command and error message is: library(RMySQL); Error in fun(...) : A MySQL Registry key was found but the folder C:\Program Files\MySQL\MySQL Administrator 1.1\/. doesn't contain a bin or lib/opt folder. That's where we need to find libmySQL.dll. Error : .onLoad failed in 'loadNamespace' for 'RMySQL' Error: package/namespace load failed for 'RMySQL' It IS true that the folder C:\Program Files\MySQL\MySQL Administrator 1.1\/. doesn't contain a bin or lib/opt folder (There is no trailing 'V' in that path name!). However, libmySQL.dll actually is there in 'C:\Program Files\MySQL\MySQL Administrator 1.1'. It is also true that MySQL 5.0.67 (community edition) is installed and all the related tools (administrator, browser, c.) work just fine. (NB: I never touch the registry unless absolutely necessary, so I don't know what R is looking at in there or if some misbehaved install program left bogus data there for R to find.) The question is, why is R looking in the wrong place for this DLL and what is the best way to solve this problem? I know a quick and dirty solution, to work around this, is to create that path and put a copy of the DLL there, but that does not strike me as adequate. I would expect that to possibly generate problems the next time I upgrade MySQL. So, then, what would you recommend? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] The time series analysis functions/packages don't seem to like my data
I have hundreds of megabytes of price data time series, and perl scripts that extract it to tab delimited files (I have C++ programs that must analyse this data too, so I get Perl to extract it rather than have multiple connections to the DB). I can read the data into an R object without any problems. thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata The above statements give me precisely what I expect. The last few lines of output are: 8190 2009-06-16 49.30 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 I have loaded Rmetrics and PerformanceAnalytics, among other packages. I tried as.timeseries, but R2.9.1 tells me there is no such function. I tried as.ts(thedata), but that only replaces the date field by the row label in 'thedata'. If I apply the performance analytics drawdowns function to either thedata or thedate$V2, I get errors: table.Drawdowns(thedata,top = 10) Error in 1 + na.omit(x) : non-numeric argument to binary operator table.Drawdowns(thedata$V2, top = 10) Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed thedata$V2 by itself does give me the price data from the file. I am a relative novice in using R for timeseries, so I wouldn't be surprised it I missed something that would be obvious to someone more practiced in using R, but I don't see what that could be from the documentation of the functions I am looking at using. I have no shortage of data, and I don't want to write C++ code, or perl code, to do all the kinds of calculations provided in, Rmetrics and performanceanalytics, but getting my data into the functions these packages provide is killing me! What did I miss? Thanks Ted __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The time series analysis functions/packages don't seem to like my data
Hi Mark Thanks for replying. Here is a short snippet that reproduces the problem: library(PerformanceAnalytics) thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata x = as.timeseries(thedata) x table.Drawdowns(thedata,top = 10) table.Drawdowns(thedata$V2, top = 10) The object 'thedata' has exactly what I expected. the line 'thedata' prints the correct contents of the file with each row prepended by a line number. The last few lines are: 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 The number of lines (8197), dates (and their format) and prices are correct. The last four lines produce the following output: x = as.timeseries(thedata) Error: could not find function as.timeseries x Error: object 'x' not found table.Drawdowns(thedata,top = 10) Error in 1 + na.omit(x) : non-numeric argument to binary operator table.Drawdowns(thedata$V2, top = 10) Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed Are the functions in your example in Rmetrics or PerformanceAnalytics? (like I said, I am just beginning this exploration, and I started with table.Drawdowns because it produces information that I need first) And given that my data is in tab delimited files, and can be read using read.csv, how do I feed my data into your four statements? My guess is I am missing something in coercing my data in (the data frame?) thedata into a timeseries array of the sort the time series analysis functions need: and one of the things I find a bit confusing is that some of the documentation for this mentions S3 classes and some mentions S4 classes (I don't know if that means I have to make multiple copies of my data to get the output I need). I could coerce thedata$V2 into a numeric vector, but I'd rather not separate the prices from their dates unless that is necessary (how would one produce monthly, annual or annualized rates of return if one did that?). Thanks Ted On Fri, Jul 3, 2009 at 6:39 PM, Mark Knechtmarkkne...@gmail.com wrote: On Fri, Jul 3, 2009 at 2:48 PM, Ted Byersr.ted.by...@gmail.com wrote: I have hundreds of megabytes of price data time series, and perl scripts that extract it to tab delimited files (I have C++ programs that must analyse this data too, so I get Perl to extract it rather than have multiple connections to the DB). I can read the data into an R object without any problems. thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata The above statements give me precisely what I expect. The last few lines of output are: 8190 2009-06-16 49.30 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 I have loaded Rmetrics and PerformanceAnalytics, among other packages. I tried as.timeseries, but R2.9.1 tells me there is no such function. I tried as.ts(thedata), but that only replaces the date field by the row label in 'thedata'. If I apply the performance analytics drawdowns function to either thedata or thedate$V2, I get errors: table.Drawdowns(thedata,top = 10) Error in 1 + na.omit(x) : non-numeric argument to binary operator table.Drawdowns(thedata$V2, top = 10) Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed thedata$V2 by itself does give me the price data from the file. I am a relative novice in using R for timeseries, so I wouldn't be surprised it I missed something that would be obvious to someone more practiced in using R, but I don't see what that could be from the documentation of the functions I am looking at using. I have no shortage of data, and I don't want to write C++ code, or perl code, to do all the kinds of calculations provided in, Rmetrics and performanceanalytics, but getting my data into the functions these packages provide is killing me! What did I miss? Thanks Ted __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Could you supply some portion of the results when you run the example on your data? The example goes like: data(edhec) R=edhec[,Funds.of.Funds] findDrawdowns(R) sortDrawdowns(findDrawdowns(R)) How are you using the function with your data? - Mark __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The time series analysis functions/packages don't seem to like my data
Hi David, Thanks for replying. On Fri, Jul 3, 2009 at 8:08 PM, David Winsemiusdwinsem...@comcast.net wrote: On Jul 3, 2009, at 7:34 PM, Ted Byers wrote: Hi Mark Thanks for replying. Here is a short snippet that reproduces the problem: library(PerformanceAnalytics) thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata x = as.timeseries(thedata) x table.Drawdowns(thedata,top = 10) table.Drawdowns(thedata$V2, top = 10) The object 'thedata' has exactly what I expected. the line 'thedata' prints the correct contents of the file with each row prepended by a line number. The last few lines are: 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 The number of lines (8197), dates (and their format) and prices are correct. The last four lines produce the following output: x = as.timeseries(thedata) Error: could not find function as.timeseries That is not telling you that there is no such function but rather that you have not loaded the package that contains it. To find out what package ( which you have installed on your machine) contains a function, you type one of these equivalents: ??as.timeseries help.search(as.timeseries) I did this, which is why I tried as.timeseries in the first place. If the needed package is not installed on your machine then you need to use one of the R search sites. I use: http://search.r-project.org/nmz.html In my installation there is a function named as.timeSeries in the package timeSeries. Not sure if that is the function you want. (Spelling must be exact in R.) If it is, then try: library(timeSeries) timeSeries was already installed. And using library(timeSeries) succeeds but does not help. x Error: object 'x' not found Not surprising, since the effort to create x failed. Right,. I wasn't surprised by this. table.Drawdowns(thedata,top = 10) Error in 1 + na.omit(x) : non-numeric argument to binary operator Not sure whether this is due to earlier errors or something that is wrong with your data. Most probably the latter, and since you have not reduced it to a reproducible example, no one can tell from a distance. If you were expecting the operation of giving thedata to as.timeseries() to have a lasting effect on thedata, you need to re-read the introductory material on R that is readily available. That's not how the language works. This only thing missing from my example is the data file itself. I have no problems providing that too, but I didn't think that was permitted (and it is too large to embed within a message. No, I did not expect thedata to be modified by as.timeseries. I just thought I'd try to see if table.Drawdowns would accept a data frame. And my call to table.Drawdowns(thedata$V2, top = 10) was to see if it would even accept a numeric vector (which is what I'd expected the price data to be represented as). Thanks Ted table.Drawdowns(thedata$V2, top = 10) Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed Are the functions in your example in Rmetrics or PerformanceAnalytics? (like I said, I am just beginning this exploration, and I started with table.Drawdowns because it produces information that I need first) And given that my data is in tab delimited files, and can be read using read.csv, how do I feed my data into your four statements? My guess is I am missing something in coercing my data in (the data frame?) thedata into a timeseries array of the sort the time series analysis functions need: and one of the things I find a bit confusing is that some of the documentation for this mentions S3 classes and some mentions S4 classes (I don't know if that means I have to make multiple copies of my data to get the output I need). I could coerce thedata$V2 into a numeric vector, but I'd rather not separate the prices from their dates unless that is necessary (how would one produce monthly, annual or annualized rates of return if one did that?). Thanks Ted On Fri, Jul 3, 2009 at 6:39 PM, Mark Knechtmarkkne...@gmail.com wrote: On Fri, Jul 3, 2009 at 2:48 PM, Ted Byersr.ted.by...@gmail.com wrote: I have hundreds of megabytes of price data time series, and perl scripts that extract it to tab delimited files (I have C++ programs that must analyse this data too, so I get Perl to extract it rather than have multiple connections to the DB). I can read the data into an R object without any problems. thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata The above statements give me precisely what I expect. The last few lines of output are: 8190 2009-06-16 49.30 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 I have loaded
Re: [R] The time series analysis functions/packages don't seem to like my data
Hi Gabor, Thanks. On Fri, Jul 3, 2009 at 8:25 PM, Gabor Grothendieckggrothendi...@gmail.com wrote: # 1. You can directly read your data into a zoo series like this: Lines - 8190 2009-06-16 49.30 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 OK. Now I have to read up on zoo too. I was going to get to that, as I saw it mentioned in a couple views related to analyzing financial data. I apologize if this is a naive question, but if I am reading my data successfully using: thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) can my thedata be used in the same way as your Lines? Or would that be a different function call? What is your Lines anyway: a vector containing a series of strings? a matrix of strings? one long string distributed over a series of lines? library(zoo) z - read.zoo(textConnection(Lines), index = 2) # and from that you can readily convert it to # other time series formats if need be. # 2. Read ?table.Drawdowns. It asks for __returns__, not raw # data as input. OOPS, so I'll need an extra step. It is trivial to convert my data to daily deltas. I was more concerned at the moment with just getting my time series data into a form the time series functions require. Thank you. This is quite useful. Cheers Ted library(PerformanceAnalytics) table.Drawdowns(diff(log(z$V3))) That gives me an error and looking into it it seems likely that table.Drawdowns fails when there is only one drawdown. library(help = PerformanceAnalytics) will give you the author's email address to whom you can report the problem. On Fri, Jul 3, 2009 at 7:34 PM, Ted Byersr.ted.by...@gmail.com wrote: Hi Mark Thanks for replying. Here is a short snippet that reproduces the problem: library(PerformanceAnalytics) thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata x = as.timeseries(thedata) x table.Drawdowns(thedata,top = 10) table.Drawdowns(thedata$V2, top = 10) The object 'thedata' has exactly what I expected. the line 'thedata' prints the correct contents of the file with each row prepended by a line number. The last few lines are: 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 The number of lines (8197), dates (and their format) and prices are correct. The last four lines produce the following output: x = as.timeseries(thedata) Error: could not find function as.timeseries x Error: object 'x' not found table.Drawdowns(thedata,top = 10) Error in 1 + na.omit(x) : non-numeric argument to binary operator table.Drawdowns(thedata$V2, top = 10) Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed Are the functions in your example in Rmetrics or PerformanceAnalytics? (like I said, I am just beginning this exploration, and I started with table.Drawdowns because it produces information that I need first) And given that my data is in tab delimited files, and can be read using read.csv, how do I feed my data into your four statements? My guess is I am missing something in coercing my data in (the data frame?) thedata into a timeseries array of the sort the time series analysis functions need: and one of the things I find a bit confusing is that some of the documentation for this mentions S3 classes and some mentions S4 classes (I don't know if that means I have to make multiple copies of my data to get the output I need). I could coerce thedata$V2 into a numeric vector, but I'd rather not separate the prices from their dates unless that is necessary (how would one produce monthly, annual or annualized rates of return if one did that?). Thanks Ted On Fri, Jul 3, 2009 at 6:39 PM, Mark Knechtmarkkne...@gmail.com wrote: On Fri, Jul 3, 2009 at 2:48 PM, Ted Byersr.ted.by...@gmail.com wrote: I have hundreds of megabytes of price data time series, and perl scripts that extract it to tab delimited files (I have C++ programs that must analyse this data too, so I get Perl to extract it rather than have multiple connections to the DB). I can read the data into an R object without any problems. thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata The above statements give me precisely what I expect. The last few lines of output are: 8190 2009-06-16 49.30 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 I have loaded Rmetrics and PerformanceAnalytics, among other packages. I tried as.timeseries, but R2.9.1 tells me there is no such function. I tried as.ts(thedata), but that only replaces the date field by the row label in 'thedata'. If I apply the
Re: [R] The time series analysis functions/packages don't seem to like my data
Hi Mark, Thanks. Your example works fine. But I see you're struggling with the same issue that I am. I also see the format of the dates in the dataset you use in your example is the same format that my dates are in. I just read it, so I haven't had a chance to investigate, but you might take a look at Gabor's response to me to see if read.zoo can help move data from a file (or whatever is returned by read.csv) into a zoo series. Cheers, Ted On Fri, Jul 3, 2009 at 8:40 PM, Mark Knechtmarkkne...@gmail.com wrote: On Fri, Jul 3, 2009 at 4:34 PM, Ted Byersr.ted.by...@gmail.com wrote: Hi Mark Thanks for replying. Here is a short snippet that reproduces the problem: library(PerformanceAnalytics) thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata x = as.timeseries(thedata) x table.Drawdowns(thedata,top = 10) table.Drawdowns(thedata$V2, top = 10) The object 'thedata' has exactly what I expected. the line 'thedata' prints the correct contents of the file with each row prepended by a line number. The last few lines are: 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 The number of lines (8197), dates (and their format) and prices are correct. The last four lines produce the following output: x = as.timeseries(thedata) Error: could not find function as.timeseries x Error: object 'x' not found table.Drawdowns(thedata,top = 10) Error in 1 + na.omit(x) : non-numeric argument to binary operator table.Drawdowns(thedata$V2, top = 10) Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed Are the functions in your example in Rmetrics or PerformanceAnalytics? (like I said, I am just beginning this exploration, and I started with table.Drawdowns because it produces information that I need first) And given that my data is in tab delimited files, and can be read using read.csv, how do I feed my data into your four statements? My guess is I am missing something in coercing my data in (the data frame?) thedata into a timeseries array of the sort the time series analysis functions need: and one of the things I find a bit confusing is that some of the documentation for this mentions S3 classes and some mentions S4 classes (I don't know if that means I have to make multiple copies of my data to get the output I need). I could coerce thedata$V2 into a numeric vector, but I'd rather not separate the prices from their dates unless that is necessary (how would one produce monthly, annual or annualized rates of return if one did that?). Thanks Ted On Fri, Jul 3, 2009 at 6:39 PM, Mark Knechtmarkkne...@gmail.com wrote: On Fri, Jul 3, 2009 at 2:48 PM, Ted Byersr.ted.by...@gmail.com wrote: I have hundreds of megabytes of price data time series, and perl scripts that extract it to tab delimited files (I have C++ programs that must analyse this data too, so I get Perl to extract it rather than have multiple connections to the DB). I can read the data into an R object without any problems. thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata The above statements give me precisely what I expect. The last few lines of output are: 8190 2009-06-16 49.30 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 I have loaded Rmetrics and PerformanceAnalytics, among other packages. I tried as.timeseries, but R2.9.1 tells me there is no such function. I tried as.ts(thedata), but that only replaces the date field by the row label in 'thedata'. If I apply the performance analytics drawdowns function to either thedata or thedate$V2, I get errors: table.Drawdowns(thedata,top = 10) Error in 1 + na.omit(x) : non-numeric argument to binary operator table.Drawdowns(thedata$V2, top = 10) Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed thedata$V2 by itself does give me the price data from the file. I am a relative novice in using R for timeseries, so I wouldn't be surprised it I missed something that would be obvious to someone more practiced in using R, but I don't see what that could be from the documentation of the functions I am looking at using. I have no shortage of data, and I don't want to write C++ code, or perl code, to do all the kinds of calculations provided in, Rmetrics and performanceanalytics, but getting my data into the functions these packages provide is killing me! What did I miss? Thanks Ted __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Could
Re: [R] The time series analysis functions/packages don't seem to like my data
Sorry, I should have read the read.zoo documentation before replying to thank Gabor for his repsonse. Here is how it starts: read.zoo(zoo) R Documentation Reading and Writing zoo Series Description read.zoo and write.zoo are convenience functions for reading and writing zoo series from/to text files. They are convenience interfaces to read.table and write.table, respectively. Usage read.zoo(file, format = , tz = , FUN = NULL, regular = FALSE, index.column = 1, aggregate = FALSE, ...) Clearly this should solve both our problems. Cheers, Ted On Fri, Jul 3, 2009 at 8:40 PM, Mark Knechtmarkkne...@gmail.com wrote: On Fri, Jul 3, 2009 at 4:34 PM, Ted Byersr.ted.by...@gmail.com wrote: Hi Mark Thanks for replying. Here is a short snippet that reproduces the problem: library(PerformanceAnalytics) thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata x = as.timeseries(thedata) x table.Drawdowns(thedata,top = 10) table.Drawdowns(thedata$V2, top = 10) The object 'thedata' has exactly what I expected. the line 'thedata' prints the correct contents of the file with each row prepended by a line number. The last few lines are: 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 The number of lines (8197), dates (and their format) and prices are correct. The last four lines produce the following output: x = as.timeseries(thedata) Error: could not find function as.timeseries x Error: object 'x' not found table.Drawdowns(thedata,top = 10) Error in 1 + na.omit(x) : non-numeric argument to binary operator table.Drawdowns(thedata$V2, top = 10) Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed Are the functions in your example in Rmetrics or PerformanceAnalytics? (like I said, I am just beginning this exploration, and I started with table.Drawdowns because it produces information that I need first) And given that my data is in tab delimited files, and can be read using read.csv, how do I feed my data into your four statements? My guess is I am missing something in coercing my data in (the data frame?) thedata into a timeseries array of the sort the time series analysis functions need: and one of the things I find a bit confusing is that some of the documentation for this mentions S3 classes and some mentions S4 classes (I don't know if that means I have to make multiple copies of my data to get the output I need). I could coerce thedata$V2 into a numeric vector, but I'd rather not separate the prices from their dates unless that is necessary (how would one produce monthly, annual or annualized rates of return if one did that?). Thanks Ted On Fri, Jul 3, 2009 at 6:39 PM, Mark Knechtmarkkne...@gmail.com wrote: On Fri, Jul 3, 2009 at 2:48 PM, Ted Byersr.ted.by...@gmail.com wrote: I have hundreds of megabytes of price data time series, and perl scripts that extract it to tab delimited files (I have C++ programs that must analyse this data too, so I get Perl to extract it rather than have multiple connections to the DB). I can read the data into an R object without any problems. thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header = FALSE, na.strings=) thedata The above statements give me precisely what I expect. The last few lines of output are: 8190 2009-06-16 49.30 8191 2009-06-17 48.40 8192 2009-06-18 47.72 8193 2009-06-19 48.83 8194 2009-06-22 46.85 8195 2009-06-23 47.11 8196 2009-06-24 46.97 8197 2009-06-25 47.43 I have loaded Rmetrics and PerformanceAnalytics, among other packages. I tried as.timeseries, but R2.9.1 tells me there is no such function. I tried as.ts(thedata), but that only replaces the date field by the row label in 'thedata'. If I apply the performance analytics drawdowns function to either thedata or thedate$V2, I get errors: table.Drawdowns(thedata,top = 10) Error in 1 + na.omit(x) : non-numeric argument to binary operator table.Drawdowns(thedata$V2, top = 10) Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed thedata$V2 by itself does give me the price data from the file. I am a relative novice in using R for timeseries, so I wouldn't be surprised it I missed something that would be obvious to someone more practiced in using R, but I don't see what that could be from the documentation of the functions I am looking at using. I have no shortage of data, and I don't want to write C++ code, or perl code, to do all the kinds of calculations provided in, Rmetrics and performanceanalytics, but getting my data into the functions these packages provide is killing me! What did I miss? Thanks Ted __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
Re: [R] The time series analysis functions/packages don't seem to like my data
On Fri, Jul 3, 2009 at 9:05 PM, Mark Knechtmarkkne...@gmail.com wrote: On Fri, Jul 3, 2009 at 5:54 PM, Ted Byersr.ted.by...@gmail.com wrote: Sorry, I should have read the read.zoo documentation before replying to thank Gabor for his repsonse. Here is how it starts: read.zoo(zoo) R Documentation Reading and Writing zoo Series Description read.zoo and write.zoo are convenience functions for reading and writing zoo series from/to text files. They are convenience interfaces to read.table and write.table, respectively. Usage read.zoo(file, format = , tz = , FUN = NULL, regular = FALSE, index.column = 1, aggregate = FALSE, ...) Clearly this should solve both our problems. Cheers, Ted Possibly but I think the big issue is the findDrawdowns function is looking for minus signs to signal the drawdown. I down think it's doing calculations from a simple equity curve. All of these functions (findDrawdowns, table.Drawdowns, etc.) all say they will accept a data.frame. My guess is the issue isn't so much dates, names, or anything else as much as making sure you have a column of percentage rise and fall numbers expressed like 0.03 0.02 -0.025 0.10 But this is trivial. I have to read the documentation further to see if it wants rates of return as a fraction (or percentages), or if daily deltas will do. Either way, it is trivial to get such numbers (in my case in the perl script I use to draw the data from my database. Even findDrawdowns(edhec[,5]) does the right thing. Copying it to R wasn't necessary. edhec has lots of columns. You can pick and one of them and get a table. This is good to know as it makes some of the analyses I need to do easier. I can create a single file with a number of series that need to be compared WRT drawdowns, VaR, c. Cheers, Ted __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Where can I find information on how to subsample a time series?
I suspect I'm looking in the wrong places, so guidance to the relevant documentation would be as welcome as a little code snippet. I have time series data stored in a MySQL database. There is the usual DATE field, along with a double precision number: there are daily values (including only normal working days: Monday through Friday). I actually have to do a couple things here. Because of how the result is to be used, I need to first create two time series. The first is the delta between 22 working days, and the second is the delta between 66 working days. I have hundreds of these datasets, and some go back 30 years. I need to estimate the correlation between 22 day deltas (i.e. is the delta for one month correlated with that of the previous month) and between the 22 day delta and the 66 day delta that ends the day before the the first day of the 22 day delta. However, I KNOW the statistical properties of the time series are not constant (so the usual assumptions do not apply to the entire series). Therefore, I want to subsample finely enough to get a reasonably sensible correlation and examine how that changes through time. (There are no tests of significance here: I just want to explore just how much the properties of these series change through time). I have C++ code, admittedly not written particularly efficiently, that does this. The question is, is it possible to do this reasonably efficiently using R? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mystery Error in midnightStandard
Hi Yohan, Thanks. On Wed, Jan 28, 2009 at 4:57 AM, Yohan Chalabi chal...@phys.ethz.ch wrote: TB == Ted Byers r.ted.by...@gmail.com on Tue, 27 Jan 2009 16:00:27 -0500 TB I wasn't even aware I was using midnightStandard. You won't TB find it in my TB script. TB TB Here is the relevant loop: TB TB date1 = timeDate(charvec = Sys.Date(), format = %Y-%m-%d) TB date1 TB dow = 3; TB for (i in 1:length(V4) ) { TB x = read.csv(as.character(V4[[i]]), header = FALSE, TB na.strings=); TB y = x[,1]; TB year = V2[[i]]; TB week = V3[[i]]; TB dtstr = sprintf(%i-%i-%i,year,week,dow); TB date2 = timeDate(dtstr, format = %Y-%U-%w); TB resultsdataframe[[i]] - difftimeDate(date1,date2,units = TB weeks); TB fp = fitdistr(y,exponential); TB print(c(V1[[i]],V2[[i]],V3[[i]],fp,fp)); TB print(c(year,week,date2,resultsdataframe[[i]])); TB resultsdataframe[[i]] - fp; TB resultsdataframe[[i]] - fp; TB } TB TB It fails with a little more than 100 records left in V4. TB TB The full error message is: TB TB Error in midnightStandard(charvec, format) : TB 'charvec' has non-NA entries of different number of characters timeDate() uses the midnight standard. The function 'midnightStandard' assumes that all entries in 'charvec' have the same 'format'. Can you please check if this is the case? It is certain that all entries have the same format, but I'm starting to think that the error message is something of a red herring. Consider this: year = 2009 week = 0 day = 3 datestr = sprintf(%i-%i-%i,year,week,day);datestr [1] 2009-0-3 date1 = timeDate(datestr, format = %Y-%U-%w); date1 GMT [1] [NA] day = 4 datestr = sprintf(%i-%i-%i,year,week,day);datestr [1] 2009-0-4 date1 = timeDate(datestr, format = %Y-%U-%w); date1 GMT [1] [2009-01-01] datestr = sprintf(%i-%i-%i,year,week,3);datestr [1] 2009-0-3 date2 = timeDate(datestr, format = %Y-%U-%w);date2 GMT [1] [NA] difftimeDate(date2,date1, units = weeks) Error in midnightStandard(charvec, format) : 'charvec' has non-NA entries of different number of characters In addition: Warning messages: 1: In min(x) : no non-missing arguments to min; returning Inf 2: In max(x) : no non-missing arguments to max; returning -Inf The first values for year, week and day are the values on which my loop dies. It returns 'NA' here. It seems clear that it is returning NA because the date that data corresponds to is 2008-12-31. The error is being produced by difftimeDate rather than timeDate (as shown by the above session). But that represents a flaw in the function design. It should fail when taking the elapsed time between a null and the present, but if I wrote such a function, I'd have it return null (perhaps with a warning) rather than just die. A bigger issue is that timeDate ought never give null here (which is what I assume 'NA' means), since all the data comes from transaction data with real dates, so the elapsed time, measured in weeks, ought to always be a valid real number that is positive semidefinite. I have not yet come to any conclusions as to how it ought to behave (whether to return new years day, along with a warning, or to return the date requested by reinvoking itself with the year and week adjusted so a valid date is returned). On a practical side, how would I test date2 to see if it is null, so I can give it a sensible default value? A more troubling thought is that with this handling of dates in this combination of SQL (my group by clause uses YEAR(transaction_date),WEEK(transaction_date)) to get the data and R to process it, the week containing new years day will ALWAYS be split in two at the first second of the new year. I'm going to have to either figure out a way to correct this, or ignore it (as it doesn't actually make things wrong, but rather it splits a sample into two unequal parts). Thoughts? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mystery Error in midnightStandard
Hi Yohan, On Wed, Jan 28, 2009 at 10:28 AM, Yohan Chalabi chal...@phys.ethz.chwrote: TB == Ted Byers r.ted.by...@gmail.com on Wed, 28 Jan 2009 09:30:58 -0500 TB It is certain that all entries have the same format, but I'm TB starting to TB think that the error message is something of a red herring. TB Consider this: TB TB year = 2009 TB week = 0 TB day = 3 TB datestr = sprintf(%i-%i-%i,year,week,day);datestr TB [1] 2009-0-3 TB date1 = timeDate(datestr, format = %Y-%U-%w); TB date1 TB GMT TB [1] [NA] TB day = 4 TB datestr = sprintf(%i-%i-%i,year,week,day);datestr TB [1] 2009-0-4 TB date1 = timeDate(datestr, format = %Y-%U-%w); TB date1 TB GMT TB [1] [2009-01-01] TB TB datestr = sprintf(%i-%i-%i,year,week,3);datestr TB [1] 2009-0-3 TB date2 = timeDate(datestr, format = %Y-%U-%w);date2 TB GMT TB [1] [NA] TB difftimeDate(date2,date1, units = weeks) TB Error in midnightStandard(charvec, format) : TB 'charvec' has non-NA entries of different number of characters TB In addition: Warning messages: TB 1: In min(x) : no non-missing arguments to min; returning Inf TB 2: In max(x) : no non-missing arguments to max; returning -Inf TB TB TB TB The first values for year, week and day are the values on TB which my loop TB dies. It returns 'NA' here. It seems clear that it is TB returning NA because TB the date that data corresponds to is 2008-12-31. TB TB The error is being produced by difftimeDate rather than timeDate TB (as shown TB by the above session). But that represents a flaw in the TB function design. This is not a flaw in timeDate. it behaves the same way as 'as.POSIXct' That the two behave the same doesn't change the assessment that the design is flawed. That doesn't mean that the function is wrong. It means only that the behaviour can be made more useful. For example, in SQL, if a given calculation returns NULL, and the result is subsequently used in another calculation, the result that returns is also NULL. That is quite useful, and admits algorithms that can react appropriately to NULLs when necessary. That is arguably better than forcing the code to fail the moment a NULL is used in a secondary calculation. In C++, OTOH, one can catch the problem earlier using, e.g., exceptions, again allowing the program to complete even when problems arise for certain values or combinations thereof. As a software engineer, I understand the issues involved in creating libraries. If I want to incorporate the functionality of a given standard suite of functions (e.g. ANSI C standard library functions, or posix functions), my first step would be to ensure I can duplicate how they behave. But I would not stop there. There are, for example, serious design flaws in many ANSI C functions that, ignored, introduce serious security defects in applications that use them. I would therefore refactor them to eliminate the security defects. If they can not be eliminated, I would replace the function in question by a similar function that does not have that security defect. Posix is a useful, but old, standard, and I am merely suggesting that once you have duplicated it, look beyond it to ways it can be improved upon. There is more to the design of a function than whether or not it gives the right result with good input. There is how it behaves when there is a problem with the inputs and whether or not you force the calling code to die when a problem arises or you give the calling code a way to react to such problems. When I add functions to my own C++ or Java libraries, I normally include more bad input data in the unit tests than good data (though the latter is sufficient to ensure correct results are invariably obtained), precisely so I can document how it behaves when there is a problem and give coders who use it a variety of options to use to deal with them. strptime(datestr, format = %Y-%U-%w) Instead of claiming that there is a flaw in the function you could have suggested an 'is.na' method for 'timeDate'. At the time, I did not know about is.na. I have spent the past hour trying is.na, but to no avail. I guess that is no surprise to you, but that it would fail is not reflected in the R documentation of is.na. That mentions S3, but not S4. As I just recently started using R, I have not yet looked at what S3 and S4 are, so that is a few more hours of study before I get this problem solved. I will add an 'is.na' method in the dev version of 'timeDate'. Thanks. I'll benefit from that once it makes it into the production release. In the mean time, I need to find a way to make something similar now, in my script. Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http
[R] Can I create a timeDate object using only year and week of the year values?
For a model I am working on, I have samples organized by year and week of the year. For this model, the data (year and week) comes from the basic sample data, but I require a value representing the amount of time since the sample was taken (actually, for the purpose of the model, it is sufficient to use the number of weeks from the middle of the sample week to the present). What I have found so far includes: library(Rmetrics) time1 = timeDate(charvec = Sys.Date(), format = %Y-%m-%d, zone = , FinCenter = ) time2 = timeDate(2004-08-30, format = %Y-%m-%d, zone = , FinCenter = ) difftimeDate(time1,time2,units = weeks) Does timeDate use the format strings used by the UNIX date(1) command? If so, then can I safely assume timeDate will accept %Y-%U-%w, and behave correctly? Thanks, Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Can I create a timeDate object using only year and week of the year values?
Thanks Patrick. On Tue, Jan 27, 2009 at 2:03 PM, Patrick Connolly p_conno...@slingshot.co.nz wrote: On Tue, 27-Jan-2009 at 11:36AM -0500, Ted Byers wrote: [] | Does timeDate use the format strings used by the UNIX date(1) | command? If so, then can I safely assume timeDate will accept | %Y-%U-%w, and behave correctly? Your chances are good. To be sure, check out ?strptime HTH According to ?strptime, the answer is yes; something I have confirmed with limited trials. -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_ Average minds discuss events (:_~*~_:) Small minds discuss people (_)-(_) . Eleanor Roosevelt ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. Smart lady! Too bad there are no great minds in power in these economically interesting times. Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Mystery Error in midnightStandard
I wasn't even aware I was using midnightStandard. You won't find it in my script. Here is the relevant loop: date1 = timeDate(charvec = Sys.Date(), format = %Y-%m-%d) date1 dow = 3; for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=); y = x[,1]; year = V2[[i]]; week = V3[[i]]; dtstr = sprintf(%i-%i-%i,year,week,dow); date2 = timeDate(dtstr, format = %Y-%U-%w); resultsdataframe$dt[[i]] - difftimeDate(date1,date2,units = weeks); fp = fitdistr(y,exponential); print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd)); print(c(year,week,date2,resultsdataframe$dt[[i]])); resultsdataframe$estimate[[i]] - fp$estimate; resultsdataframe$sd[[i]] - fp$sd; } It fails with a little more than 100 records left in V4. The full error message is: Error in midnightStandard(charvec, format) : 'charvec' has non-NA entries of different number of characters Until it fails, date2 and resultsdataframe$dt[[i]] get correct values. str() produces no surprises: str(resultsdataframe); 'data.frame':303 obs. of 6 variables: $ mid : int 171 206 206 206 206 206 206 206 206 218 ... $ year: int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ... $ week: int 16 17 18 19 21 26 31 35 51 40 ... $ dt : num 39.9 38.9 37.9 36.9 34.9 ... $ estimate: num Inf 0.25 Inf 0.0408 0.2 ... $ sd : num Inf 0.1768 Inf 0.0289 0.1414 ... I would assume the error is related to my new code that manipulates dates, as it doesn't occur in the earlier version that did not manipulate dates (the relevant work being done, albeit very slowly, within the DB). FTR: The year and week values are generated by MySQL using the YEAR and WEEK functions applied to timestamps. I do not know if it is relevant, but the week value, at the point of failure, is 0 (a value that does not occur earlier in the dataset, but several times subsequently), and I do not see how a value of 0 for the week (legitimate in posix date formats) could produce the error message I get. Any thoughts on what is really wrong, and how to fix it? Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Staging area for data before read into R
There are tradeoffs no matter what route you take. I worked on a project a few years ago, repairing an MS Access DB that had been constructed, data entry forms and all, by one of the consulting engineers. They supported that development because they found that even with all the power and utility of Excel, in support of data entry, errors were still much too common, requiring significant time and therefore money to address. The more data you have, the more costly it is to find and repair the data when something goes awry. Their problem was that it was put together in haste by a man who knew nothing about RDBMS. He was learning as he went. While what he produced was adequate for a single site with no turnover in staff, and would have been fine if it was intended for his own use in his consulting practice, it would inevitably have broken the moment the service was extended to more than one site or the moment there was any turnover in staff at all. The client he delivered it to was a large mining company that wanted to deploy it on all their mines and ore processing facilities. Yes the error rate in data entry went way down, but the mistake was in trying to deliver a software product to a client without the input of an experienced software engineer. You can do validation in Access as you can in Excel, but Excel is not designed to manage data where Access is, and both are crippled by their dependance on VB (a seriouusly broken language: fine for scripting MS Office, but not what you want to develop a real application - not that the OP wants that anyway). I don't want to beat on Excel, as it is a useful tool when used for what it is designed for; and others have pointed out some hazards when using it. Dr. Snow is right in recommending going the route of using an RDBMS and in saying that it isn't that hard to get started. I'd be recommending PostgreSQL, though, since it is relatively easy to use, and it has pl/r (which lets you run R code within stored procedures in the DB) which carries obvious advantages. The bottom line is that the best option depends on your objectives and what you need to do. You can use Excel quite effectively if you are careful and know what you're doing. If you are going to manage data that will require significant effort to enter, you may want something a bit more robust and better designed to manage your data. If you are going to deliver services based on your software, you need a software engineer to ensure it doesn't break on your client (that could be quite costly). Since the OP is apparently using it for his own purposes, and unlikely to be selling the data or services based on it, the services of an engineer aren't needed, though they can be useful if there are concerns about administering the DB so as to guarantee the security of the data. I could tell you tales of how data that cost millions of dollars to collect were almost lost because a consultant was careless in this regard and made mistakes in handling the data. Fortunately, recovery was quick in these cases because my colleagues were diligent in maintaining backups. But you get the point. Murphy's law says that whatever can go wrong, will. There are plenty of options, and the OP will need to do what he's most confortable doing. If I were in his place, I'd say my data is sacred, and can not be replaced (just as you can't step into the same stream twice); and therefore I'd use a RDBMS to manage it, and the very moment it is all entered, I'd make a backup of both the data (e.g. in MySQL I'd use mysqldump) AND the software, and copy both backups to two CDs or DVDs. And, if the data were originally recorded on paper, I'd be scanning the pages and copying those images onto a couple CDs or DVDs also: with two copies on optical media, one copy can be stored in a fireproof vault while the other is in the office ready to be used should a HDD fail, or some other disaster interrupt my work. OK, so I'm paranoid about my data, but I'd rather go the extra mile than risk losing it. Cheers, Ted Gabor Grothendieck wrote: Excel has a data validation facility and also has data input forms to facilitate data entry. On Tue, Oct 21, 2008 at 1:45 PM, Greg Snow [EMAIL PROTECTED] wrote: Stephen, One of the big problems with spreadsheets (other than the column limit in some) is that the standard entry mode allows too much flexibility which does nothing to help you avoid data entry errors. The Webpage: http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html has some examples of this going wrong, including one that happened to my group where the column for dates was not preformatted, the dates were entered using European format, and Excel did 2 different wrong things with them making it very difficult to do anything with the data without major extra work. If you are going to stick with a spreadsheet, then at a minimum you should start by naming all your columns, then formatting each
Re: [R] Staging area for data before read into R
I wasn't suggesting that the validation requires VB. Creating forms and handling form events does (unless MS has introduced new utilities to hide all that since last I used it). Some of the most interesting things I have seen done with Excel did involve VB, and there are better tools to do most of those things. Gabor Grothendieck wrote: On Tue, Oct 21, 2008 at 3:18 PM, Ted Byers [EMAIL PROTECTED] wrote: There are tradeoffs no matter what route you take. You can do validation in Access as you can in Excel, but Excel is not designed to manage data where Access is, and both are crippled by their dependance on VB (a seriouusly broken language: fine for scripting MS Excel can do validation without VB. For example, you can restrict data to a certain range of dates, limit choices by using a list, or make sure that only positive whole numbers are entered all without any VB. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Staging-area-for-data-before-read-into-R-tp20075962p20099445.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Staging area for data before read into R
No. Excel, like most spreadsheets, does what it designed for reasonably well. It is easy to find fault, but not so easy to satisfy all one's critics. There is no doubt that Excel has faults, but it provides significant modelling and analysis capability to users with no programming expertise or limited experience using IT. I have used it as a teaching tool for very basic modelling to undergraduate students who would not have been able to do any modelling without it. In a one session course, there just isn't time to teach students enough programming in any language for them to have a hope of producing an interesting model. But they can produce an interesting model with some guidance using Excel. Similarly, they can do elementary data analysis entering their data into Excel and using it to analyse it. Excel was designed primarily for business people, and I have seen them use it effectively, doing things I don't fully understand (as I am not a businessman). But these same people would go into a catatonic state the moment a discussion becomes technical or mathematical. They describe Excel as powerful, and until I become an expert MBA type, I won't knock them for that. If they find it useful, why would I argue with them. Don't get me wrong, I do not normally use it, and for 99% of the work I do, it provides no value to me, so I do not have it installed on my own systems. I am better served by C++, Java, and the related tools specific to my work. But that it isn't useful to me, or apparently you, is not sufficient grounds to question its utility for others (neither is the existance of bugs, as ALL software has bugs: MS makes for an easy target, but I try to be as fair to them as I am to an independant developer who works alone - lets not have this degenerate into an attack on MS, please). As a software engineer myself, I won't knock the work of another just because what he's produced isn't particularly useful for me. I won't even knock him if I don't agree with the design decisions he's made. When that happens, it is likely I was not part of his intended market: nothing more can be implied. Rolf Turner-3 wrote: On 22/10/2008, at 8:18 AM, Ted Byers wrote: snip ... even with all the power and utility of Excel ... snip Is this some kind of joke? cheers, Rolf Turner ## Attention:\ This e-mail message is privileged and confid...{{dropped:9}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Staging-area-for-data-before-read-into-R-tp20075962p20099848.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Staging area for data before read into R
Ah, OK. That is new since I used Excel last. Thanks On Tue, Oct 21, 2008 at 5:52 PM, Gabor Grothendieck [EMAIL PROTECTED] wrote: You can create data entry forms without VB in Excel too. On Tue, Oct 21, 2008 at 5:09 PM, Ted Byers [EMAIL PROTECTED] wrote: I wasn't suggesting that the validation requires VB. Creating forms and handling form events does (unless MS has introduced new utilities to hide all that since last I used it). Some of the most interesting things I have seen done with Excel did involve VB, and there are better tools to do most of those things. Gabor Grothendieck wrote: On Tue, Oct 21, 2008 at 3:18 PM, Ted Byers [EMAIL PROTECTED] wrote: There are tradeoffs no matter what route you take. You can do validation in Access as you can in Excel, but Excel is not designed to manage data where Access is, and both are crippled by their dependance on VB (a seriouusly broken language: fine for scripting MS Excel can do validation without VB. For example, you can restrict data to a certain range of dates, limit choices by using a list, or make sure that only positive whole numbers are entered all without any VB. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Staging-area-for-data-before-read-into-R-tp20075962p20099445.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to get estimate of confidence interval?
I thought I was finished, having gotten everything to work as intended. This is a model of risk, and the short term forecasts look very good, given the data collected after the estimates are produced (this model is intended to be executed daily, to give a continuing picture of our risk). But now there is a new requirement. I have weekly samples from a non-autonomous process (i.e. although well modelled as a decay process, with an exponential distribution fitting the decay times well, the rate estimates and their sd vary considerably from one week to the next). The total number of events to be expected from a given sample over the next week can be easily estimated from a simple integral. And the total number of these events from all samples, is just the sum of these estimates over all samples. So far, so good (imagine you have a sample of a variety of species of radionuclides all emitting alpha particles with the same energy - so you can't tell from the decay event which species produced the alpha particles). I guess there are two parts of my question. I get a fit of the exponential distribution to each sample using fitdistr(x,exponential). I am finding the expected values vary by as much as a factor of 4, and the corresponding estimates of sd vary by as much as a factor of 100 (some samples are MUCH larger than others). How do I go from the sd it gives to a 99% confidence interval for the integral for that function from now through a week from now (or to the end of time, or through the next month/quarter)? And how do I move from these estimates to get the expected value and confidence intervals for the totals over all the samples? I am a bit rusty on figuring out how error propagates through model calculations (an online reference for this would be handy, if you know of one). Thanks Ted -- View this message in context: http://www.nabble.com/How-to-get-estimate-of-confidence-interval--tp20073921p20073921.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Staging area for data before read into R
Define better. Really, it depends on what you need to do (are all your data appropriately represented in a 2D array?) and what resources are available. If all your data can be represented using a 2D array, then Excel is probably your best bet for th enear term. If not, you might as well bite the bullit and learn to use an RDBMS, as there are few other data management options that can cope with relational or hierarchical or object oriented data. I use a number of different RDBMS (ranging from MS SQL to PostgreSQL and MySQL). I also use Excel on occasion, and plain text editors (like Emacs), to create CSV files. Which I use depends on the details of the particular problem I am facing. While I have not yet explored them, I did notice that R includes a number of facilities for editing data (and the list of options is all the longer when I use help.search(edit). It may be a bit quicker for you to study up on basic use of something like PostgreSQL, combined with pl/r (something I wish MySQL had), than it would be to diligently examine all the different options open to you using R. (I have a couple books I could recommend that would likely be sufficient for you to figure out what you need to do with either PostgreSQL or MySQL in a matter of a week or two). HTH Ted stephen sefick wrote: I am wondering if there is a better alternative than Excel for data storage that does not require database knowledge (I will eventually have to learn this, but it is not on my immediate todo list). I need something that is not limited to 256 columns... I don't need any of the built in functions in excel just a spreadsheet like program with cells that hold data in a data.frame format for a staging area before I get it into R. Any help would be greatly appreciated. This is not a direct r question, but all of you folks have more experience than I do and I am having a time finding what I need with google. thanks in advance -- Stephen Sefick Research Scientist Southeastern Natural Sciences Academy Let's not spend our time and resources thinking about things that are so little or so large that all they really do for us is puff us up and make us feel like gods. We are mammals, and have not exhausted the annoying little problems of being mammals. -K. Mullis __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Staging-area-for-data-before-read-into-R-tp20075962p20078353.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] dbAppy questions/clarifications
In the example in the documentation, I see: rs - dbSendQuery(con, select Agent, ip_addr, DATA from pseudo_data order by Agent) out - dbApply(rs, INDEX = Agent, FUN = function(x, grp) quantile(x$DATA, names=FALSE)) Maybe I am a bit thick, but it took me a while, and a kind hint from Phil, to figure much of this out. It is clear that the SQL orders the data by Agent, and the INDEX parameter tells dbApply that FUN is to be applied to each group of values defined by Agent (like applying SUM(DATA) in SQL using a GROUP BY clause). If my understanding is correct, out will be an array holding ordered pairs, with the value of Agent and the corresponding values returned by FUN. I take it FUN = function(x, grp) quantile(x$DATA, names=FALSE) is the function definition for a function called FUN. I would guess, then, that the opening and closing braces are optional. Is that correct? Or is this something else? I did not see a definition of 'grp'. What is it? Suppose the function I want to apply is fitdistr(x,exponential). Would I just replace quantile(x$DATA, names=FALSE) by fitdistr(x,exponential)? Finally, suppose the query I need to run is more complex, such as: SELECT group_id,YEAR(my_date),WEEK(my_date),ndays FROM myTable ORDER BY group_id,YEAR(my_date),WEEK(my_date); Can dbApply handle applying fitdistr(x,exponential) to each group of values defined by group_id,YEAR(my_date),WEEK(my_date)? If so, how would I change the call to dbsendQuery, and how would I insert the resulting estimates using something like INSERT INTO myResults (group_id,year,week,rate,sd) VALUES (?,?,?,?);? Once I get this, I can do everything else within a stored procedure in MySQL. I get the idea of using,e.g., sprintf to interpolate values I need to insert into a query string, but it is a question of how to get the values I need from 'out' (to use the above example), and how to iterate over them to do the SQL INSERT. Actually, would 'dbWriteTable' handle inserting these values efficiently? If so, how do I ensure it maps the group_id,year, week, c. from 'out' to the right columns in my results table (what I have in mind involves a table with a couple extra columns that would take appropriate default values)? Thanks Ted -- View this message in context: http://www.nabble.com/dbAppy-questions-clarifications-tp2632p2632.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Can R scripts executed in batch mode take a commandline argument?
I have examined the documentation for batch mode use of R: R CMD BATCH [options] infile [outfile] The documentation for this seems rather spartan. Running R CMD BATCH --help gives me info on only two options: one for getting help and the other to get the version. I see, further on, that there are options for retoring and saving sessions (which I do not need to do in this case), but are there other options defined? If so, what are they and how are they to be used? However, it goes on to say: Further arguments starting with a '-' are considered as options as long as '--' was not encountered, and are passed on to the R process, which by default is started with '--restore --save'. I see here it says further arguments starting with a '-' are passed to the R process, but usage is not clear. For example, if I write a script that should take as a commandline argument the name of a file that contains a series of numbers that I want to place into a vector which, in turn I want to pass to fitdistr(x,exponential), what do I do to get that file name from the commandline and pass it to, say, read.csv? BTW: How would I tell it that there is no need to restore and save? If I can't pass a commandline argument, do I have to write the arguments in afile, and have that file read each time I need to run the script? Thanks Ted -- View this message in context: http://www.nabble.com/Can-R-scripts-executed-in-batch-mode-take-a-commandline-argument--tp2914p2914.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Argh! Trouble using string data read from a file
Here is what I tried: optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) optdata attach(optdata) for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE, na.strings=);x } And here is the outcome (just a few of the 60 records successfully read): optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) optdata V1 V2 V3V4 1 251 2008 18 Plus_Shipping.2008.18.dat 2 251 2008 19 Plus_Shipping.2008.19.dat 3 251 2008 20 Plus_Shipping.2008.20.dat 4 251 2008 22 Plus_Shipping.2008.22.dat 5 251 2008 23 Plus_Shipping.2008.23.dat 6 251 2008 24 Plus_Shipping.2008.24.dat I can see the data has been correctly read. But for some reason that isn't clear, read.csv doesn't like the data in the last column. attach(optdata) for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE, na.strings=);x } Error in read.table(file = file, header = header, sep = sep, quote = quote, : 'file' must be a character string or connection V4[[1]] [1] Plus_Shipping.2008.18.dat 60 Levels: Easyway.2008.17.dat Easyway.2008.18.dat Easyway.2008.19.dat Easyway.2008.20.dat ... Secured_Pay.2008.31.dat The last column is comprised of valid Windows filenames (and no whitespace, so as not to confuse things). I see in the docuentation `[[...]]' is the operator used to select a single element, whereas `[...]' is a general subscripting operator., so I assume V4[[i]] is the correct way to get the ith value from V4. So why does read.csv complain that 'file' must be a character string or connection? It seems obvious that the value in V4[[i]i] is a string. V4[[1]] does give me the right value, although that is followed by output I didn't ask for. In the loop above, I was going to replace the output obtained by 'x' with output from fitdistr(x,exponential), but I can't proceed with that until I can get the data in these files read. What have I missed? Thanks Ted -- View this message in context: http://www.nabble.com/Argh%21--Trouble-using-string-data-read-from-a-file-tp20002064p20002064.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Argh! Trouble using string data read from a file
Actually, I'd tried single brackets first. Here is what I got: for (i in 1:length(V4) ) { x = read.csv(V4[i], header = FALSE, na.strings=);x } Error in read.table(file = file, header = header, sep = sep, quote = quote, : 'file' must be a character string or connection the advice to use as.character worked, in that progress has been made. Can you guys explain the following output, though? setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel) for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=);x } x V1 1 0 x = read.csv(as.character(V4[[1]]), header = FALSE, na.strings=);x V1 1 0 2 0 321 4 0 5 1 6 7 751 820 9 3 105 116 128 132 140 152 164 17 23 Clearly, if I hand write a line to read the data, getting the file name from V4 (in this case V4[[1]]), I get the data into 'x', which I can then display. I only displayed the first few as some of these files will have thousands of values. But what puzzles me is that I saw virtually no output from my loop. I thought what would happen (with the x after the ';') is that the contents of each file would be displayed after it is read and before the first is read. And after the loop finishes, there is nothing in x. I don't see why the contents of x would disappear after the loop, unless R has scoping restrictions as stringent as, say, C++ (e.g. a variable declared inside a loop is not visible outside the loop). But that would beg the question as to how to declare a variable before it is first used. This doesn't bode well for me, or perhaps my ability to learn a new trick at my age, when such a simple loop should give me such trouble. :-( Getting more grey hair by the minute. :-( Thanks ted On Wed, Oct 15, 2008 at 5:12 PM, Rolf Turner [EMAIL PROTECTED] wrote: On 16/10/2008, at 10:03 AM, jim holtman wrote: try putting as.character in the call: x = read.csv(as.character(V4[[i]]), header = FALSE No. This won't help. V4 is a column of the data frame optdata, and hence is a vector. Not a list! Use single brackets --- V4[i] --- and all will be well. cheers, Rolf On Wed, Oct 15, 2008 at 4:46 PM, Ted Byers [EMAIL PROTECTED] wrote: Here is what I tried: optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) optdata attach(optdata) for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE, na.strings=);x } And here is the outcome (just a few of the 60 records successfully read): optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) optdata V1 V2 V3V4 1 251 2008 18 Plus_Shipping.2008.18.dat 2 251 2008 19 Plus_Shipping.2008.19.dat 3 251 2008 20 Plus_Shipping.2008.20.dat 4 251 2008 22 Plus_Shipping.2008.22.dat 5 251 2008 23 Plus_Shipping.2008.23.dat 6 251 2008 24 Plus_Shipping.2008.24.dat I can see the data has been correctly read. But for some reason that isn't clear, read.csv doesn't like the data in the last column. attach(optdata) for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE, na.strings=);x } Error in read.table(file = file, header = header, sep = sep, quote = quote, : 'file' must be a character string or connection V4[[1]] [1] Plus_Shipping.2008.18.dat 60 Levels: Easyway.2008.17.dat Easyway.2008.18.dat Easyway.2008.19.dat Easyway.2008.20.dat ... Secured_Pay.2008.31.dat The last column is comprised of valid Windows filenames (and no whitespace, so as not to confuse things). I see in the docuentation `[[...]]' is the operator used to select a single element, whereas `[...]' is a general subscripting operator., so I assume V4[[i]] is the correct way to get the ith value from V4. So why does read.csv complain that 'file' must be a character string or connection? It seems obvious that the value in V4[[i]i] is a string. V4[[1]] does give me the right value, although that is followed by output I didn't ask for. In the loop above, I was going to replace the output obtained by 'x' with output from fitdistr(x,exponential), but I can't proceed with that until I can get the data in these files read. What have I missed? Thanks Ted -- View this message in context: http://www.nabble.com/Argh%21--Trouble-using-string-data-read-from-a-file-tp20002064p20002064.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org
Re: [R] Argh! Trouble using string data read from a file
Thanks Jim, I hadn't seen the distinction between the commandline in RGui and what happens within my code. I have, however seen other differences I don't understand. For example, looking at the documentation for RScript, I see: Rscript [options] [-e expression] file [args] And the example: Rscript -e 'date()' -e 'format(Sys.time(), %a %b %d %X %Y)' So I tried it (Windows XP; R2.7.2), and this is what I got with just copy directly from the documentation and pasting into the Windows commandline window: C:\Rscript -e 'date()' -e 'format(Sys.time(), %a %b %d %X %Y)' [1] date() C:\Rscript -e 'format(Sys.time(), %a %b %d %X %Y)' C:\ But within RGui, I get: date();format(Sys.time(), %a %b %d %X %Y) [1] Wed Oct 15 20:36:57 2008 [1] Wed Oct 15 8:36:57 PM 2008 Thanks again Ted On Wed, Oct 15, 2008 at 8:09 PM, jim holtman [EMAIL PROTECTED] wrote: You have to explicitly 'print' the value of x in the loop:print(x) 'x' by itself is just it value. At the command line, typing an objects name is equivalent to printing that object, but it only happens at the command line. If you want a value printed, the 'print' it. Also works at the command line if you want to use it there also. On Wed, Oct 15, 2008 at 5:36 PM, Ted Byers [EMAIL PROTECTED] wrote: Actually, I'd tried single brackets first. Here is what I got: for (i in 1:length(V4) ) { x = read.csv(V4[i], header = FALSE, na.strings=);x } Error in read.table(file = file, header = header, sep = sep, quote = quote, : 'file' must be a character string or connection the advice to use as.character worked, in that progress has been made. Can you guys explain the following output, though? setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel) for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=);x } x V1 1 0 x = read.csv(as.character(V4[[1]]), header = FALSE, na.strings=);x V1 1 0 2 0 321 4 0 5 1 6 7 751 820 9 3 105 116 128 132 140 152 164 17 23 Clearly, if I hand write a line to read the data, getting the file name from V4 (in this case V4[[1]]), I get the data into 'x', which I can then display. I only displayed the first few as some of these files will have thousands of values. But what puzzles me is that I saw virtually no output from my loop. I thought what would happen (with the x after the ';') is that the contents of each file would be displayed after it is read and before the first is read. And after the loop finishes, there is nothing in x. I don't see why the contents of x would disappear after the loop, unless R has scoping restrictions as stringent as, say, C++ (e.g. a variable declared inside a loop is not visible outside the loop). But that would beg the question as to how to declare a variable before it is first used. This doesn't bode well for me, or perhaps my ability to learn a new trick at my age, when such a simple loop should give me such trouble. :-( Getting more grey hair by the minute. :-( Thanks ted On Wed, Oct 15, 2008 at 5:12 PM, Rolf Turner [EMAIL PROTECTED] wrote: On 16/10/2008, at 10:03 AM, jim holtman wrote: try putting as.character in the call: x = read.csv(as.character(V4[[i]]), header = FALSE No. This won't help. V4 is a column of the data frame optdata, and hence is a vector. Not a list! Use single brackets --- V4[i] --- and all will be well. cheers, Rolf On Wed, Oct 15, 2008 at 4:46 PM, Ted Byers [EMAIL PROTECTED] wrote: Here is what I tried: optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) optdata attach(optdata) for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE, na.strings=);x } And here is the outcome (just a few of the 60 records successfully read): optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) optdata V1 V2 V3V4 1 251 2008 18 Plus_Shipping.2008.18.dat 2 251 2008 19 Plus_Shipping.2008.19.dat 3 251 2008 20 Plus_Shipping.2008.20.dat 4 251 2008 22 Plus_Shipping.2008.22.dat 5 251 2008 23 Plus_Shipping.2008.23.dat 6 251 2008 24 Plus_Shipping.2008.24.dat I can see the data has been correctly read. But for some reason that isn't clear, read.csv doesn't like the data in the last column. attach(optdata) for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE, na.strings=);x } Error in read.table(file = file, header = header, sep = sep, quote = quote, : 'file' must be a character string or connection V4[[1]] [1] Plus_Shipping.2008.18.dat 60 Levels: Easyway.2008.17.dat Easyway.2008.18.dat Easyway.2008.19.dat Easyway.2008.20.dat ... Secured_Pay.2008.31.dat The last column is comprised of valid Windows filenames (and no whitespace, so as not to confuse things
[R] Two last questions: about output
Here is my little scriptlet: optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) attach(optdata) library(MASS) setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel) for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=); y = x[,1]; fp = fitdistr(y,exponential); print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd)) } And here are the first few lines of output: rate rate 2.51e+02 2.008000e+03 1.80e+01 6.869301e-02 6.462095e-03 rate rate 2.51e+02 2.008000e+03 1.90e+01 5.958023e-02 4.491029e-03 rate rate 2.51e+02 2.008000e+03 2.00e+01 8.631714e-02 7.428996e-03 rate rate 2.51e+02 2.008000e+03 2.20e+01 1.261538e-01 1.137491e-02 rate rate 2.51e+02 2.008000e+03 2.30e+01 1.339523e-01 1.332875e-02 rate rate 2.51e+02 2.008000e+03 2.40e+01 8.916084e-02 1.248501e-02 There are only two things wrong, here. 1) the first three columns are integers, and are output variously as integers, floating point numbers and, as shown here, in scientific notation. 2) this output isn't going to a file or to my DB. This second issue isn't much of a problem, as I think I know now how to deal with it. This output data is, in one sense, perfectly organized, and there is a table with a nearly identical structure (these five columns, plus one to hold the date on which the analysis is performed (and of course, therefore, it has a default value of the current timestamp - handled in MySQL). If I can get the data written to a CSV file, with the first three columns provided as integers, I can use the DB's bulk load utility to get the data into the DB, and this may be faster than having this scriptlet connecting directly to the DB to insert the data (unless the DBI has a function for a bulk load that helps here). Any idea how best to handle my formatting problem here? Thanks Ted -- View this message in context: http://www.nabble.com/Two-last-questions%3A-about-output-tp20005519p20005519.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Two last questions: about output
Thanks Gabor, I get how to make a frame using existing vectors. In my example, the following puts my first three columns into a frame (and displays it: testframe - data.frame(mid=V1,year=V2,week=V3) testframe mid year week 1 251 2008 18 2 251 2008 19 3 251 2008 20 4 251 2008 22 5 251 2008 23 6 251 2008 24 7 251 2008 25 I show the first of about 60 rows, and I am pleased that these values appear as integers. But what I don't see is how to add the fp$estimate,fp$sd values obtained from my analyses to vectors to form the last two columns in the data frame. Is there something like a vector type, analogous to the vector class std::vector from C++, that has a push_back function allowing a vector to grow as new values are generated? And suppose I have the following table in MySQL (ignoring for the moment keys and indeces): CREATE TABLE ( id INTEGER UNSIGNED NOT NULL auto_increment, mid INTEGER NOT NULL, y INTEGER NOT NULL, w INTEGER NOT NULL, rate DOUBLE NOT NULL, sd DOUBLE NOT NULL process_date DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ) ENGINE=InnoDB; How would I tell dbWriteTable() that my frame's five columns correspond to mid,y,w,rate and sd in that order, and that the fields id and process_date will take the appropriate default values? Or do I need a temporary table, in memory, that has only the five columns, and use a stored procedure to move the data to its final home? Thanks again, Ted On Wed, Oct 15, 2008 at 9:57 PM, Gabor Grothendieck [EMAIL PROTECTED] wrote: Put the data in an R data frame and use dbWriteTable() to write it to your MySQL database directly. On Wed, Oct 15, 2008 at 9:34 PM, Ted Byers [EMAIL PROTECTED] wrote: Here is my little scriptlet: optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) attach(optdata) library(MASS) setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel) for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=); y = x[,1]; fp = fitdistr(y,exponential); print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd)) } And here are the first few lines of output: rate rate 2.51e+02 2.008000e+03 1.80e+01 6.869301e-02 6.462095e-03 rate rate 2.51e+02 2.008000e+03 1.90e+01 5.958023e-02 4.491029e-03 rate rate 2.51e+02 2.008000e+03 2.00e+01 8.631714e-02 7.428996e-03 rate rate 2.51e+02 2.008000e+03 2.20e+01 1.261538e-01 1.137491e-02 rate rate 2.51e+02 2.008000e+03 2.30e+01 1.339523e-01 1.332875e-02 rate rate 2.51e+02 2.008000e+03 2.40e+01 8.916084e-02 1.248501e-02 There are only two things wrong, here. 1) the first three columns are integers, and are output variously as integers, floating point numbers and, as shown here, in scientific notation. 2) this output isn't going to a file or to my DB. This second issue isn't much of a problem, as I think I know now how to deal with it. This output data is, in one sense, perfectly organized, and there is a table with a nearly identical structure (these five columns, plus one to hold the date on which the analysis is performed (and of course, therefore, it has a default value of the current timestamp - handled in MySQL). If I can get the data written to a CSV file, with the first three columns provided as integers, I can use the DB's bulk load utility to get the data into the DB, and this may be faster than having this scriptlet connecting directly to the DB to insert the data (unless the DBI has a function for a bulk load that helps here). Any idea how best to handle my formatting problem here? Thanks Ted -- View this message in context: http://www.nabble.com/Two-last-questions%3A-about-output-tp20005519p20005519.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Argh! Trouble using string data read from a file
Thank you Prof. Ripley. I appreciate this. Have a good day. Ted On Thu, Oct 16, 2008 at 12:20 AM, Prof Brian Ripley [EMAIL PROTECTED] wrote: On Wed, 15 Oct 2008, Ted Byers wrote: Thanks Jim, I hadn't seen the distinction between the commandline in RGui and what happens within my code. I have, however seen other differences I don't understand. For example, looking at the documentation for RScript, I see: Rscript [options] [-e expression] file [args] And the example: Rscript -e 'date()' -e 'format(Sys.time(), %a %b %d %X %Y)' So I tried it (Windows XP; R2.7.2), and this is what I got with just copy directly from the documentation and pasting into the Windows commandline window: Your problem is the shell quoting: the Windows shell requires . E.g. C:\ d:/R/R-2.7.2/bin/Rscript -e date() -e format(Sys.time(), \%a %b %d %X %Y\) [1] Thu Oct 16 05:16:46 2008 [1] Thu Oct 16 05:16:46 2008 Other shells (e.g. bash, tcsh) do allow '', and indeed that is the preferred form there. See ?shQuote . C:\Rscript -e 'date()' -e 'format(Sys.time(), %a %b %d %X %Y)' [1] date() C:\Rscript -e 'format(Sys.time(), %a %b %d %X %Y)' C:\ But within RGui, I get: date();format(Sys.time(), %a %b %d %X %Y) [1] Wed Oct 15 20:36:57 2008 [1] Wed Oct 15 8:36:57 PM 2008 Thanks again Ted On Wed, Oct 15, 2008 at 8:09 PM, jim holtman [EMAIL PROTECTED] wrote: You have to explicitly 'print' the value of x in the loop:print(x) 'x' by itself is just it value. At the command line, typing an objects name is equivalent to printing that object, but it only happens at the command line. If you want a value printed, the 'print' it. Also works at the command line if you want to use it there also. On Wed, Oct 15, 2008 at 5:36 PM, Ted Byers [EMAIL PROTECTED] wrote: Actually, I'd tried single brackets first. Here is what I got: for (i in 1:length(V4) ) { x = read.csv(V4[i], header = FALSE, na.strings=);x } Error in read.table(file = file, header = header, sep = sep, quote = quote, : 'file' must be a character string or connection the advice to use as.character worked, in that progress has been made. Can you guys explain the following output, though? setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel) for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=);x } x V1 1 0 x = read.csv(as.character(V4[[1]]), header = FALSE, na.strings=);x V1 1 0 2 0 321 4 0 5 1 6 7 751 820 9 3 105 116 128 132 140 152 164 17 23 Clearly, if I hand write a line to read the data, getting the file name from V4 (in this case V4[[1]]), I get the data into 'x', which I can then display. I only displayed the first few as some of these files will have thousands of values. But what puzzles me is that I saw virtually no output from my loop. I thought what would happen (with the x after the ';') is that the contents of each file would be displayed after it is read and before the first is read. And after the loop finishes, there is nothing in x. I don't see why the contents of x would disappear after the loop, unless R has scoping restrictions as stringent as, say, C++ (e.g. a variable declared inside a loop is not visible outside the loop). But that would beg the question as to how to declare a variable before it is first used. This doesn't bode well for me, or perhaps my ability to learn a new trick at my age, when such a simple loop should give me such trouble. :-( Getting more grey hair by the minute. :-( Thanks ted On Wed, Oct 15, 2008 at 5:12 PM, Rolf Turner [EMAIL PROTECTED] wrote: On 16/10/2008, at 10:03 AM, jim holtman wrote: try putting as.character in the call: x = read.csv(as.character(V4[[i]]), header = FALSE No. This won't help. V4 is a column of the data frame optdata, and hence is a vector. Not a list! Use single brackets --- V4[i] --- and all will be well. cheers, Rolf On Wed, Oct 15, 2008 at 4:46 PM, Ted Byers [EMAIL PROTECTED] wrote: Here is what I tried: optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) optdata attach(optdata) for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE, na.strings=);x } And here is the outcome (just a few of the 60 records successfully read): optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) optdata V1 V2 V3V4 1 251 2008 18 Plus_Shipping.2008.18.dat 2 251 2008 19 Plus_Shipping.2008.19.dat 3 251 2008 20 Plus_Shipping.2008.20.dat 4 251 2008 22 Plus_Shipping.2008.22.dat 5 251 2008 23 Plus_Shipping.2008.23.dat 6 251 2008 24 Plus_Shipping.2008.24.dat I can see the data has been correctly read. But for some reason that isn't clear, read.csv doesn't like
Re: [R] Two last questions: about output
Thanks Gabor, To be clear, would something like testframe$est[[i]] - fp$estimate be valid within my loop, as in (assuming I created testframe before the loop): for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=); y = x[,1]; fp = fitdistr(y,exponential); print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd)) testframe$est[[i]] - fp$estimate testframe$sd[[i]] - fp$sd } Thanks Ted On Thu, Oct 16, 2008 at 12:08 AM, Gabor Grothendieck [EMAIL PROTECTED] wrote: testframe$newvar - ...whatever... (or see ?transform for another way) adds a new column to the data frame. The table does not have to pre-exist in your MySQL database and you don't need a create statement; however, if the table does pre-exist the columns of your data frame and those of the database table should have the same names in the same order and use dbWriteTable(..., append = TRUE) On Wed, Oct 15, 2008 at 11:54 PM, Ted Byers [EMAIL PROTECTED] wrote: Thanks Gabor, I get how to make a frame using existing vectors. In my example, the following puts my first three columns into a frame (and displays it: testframe - data.frame(mid=V1,year=V2,week=V3) testframe mid year week 1 251 2008 18 2 251 2008 19 3 251 2008 20 4 251 2008 22 5 251 2008 23 6 251 2008 24 7 251 2008 25 I show the first of about 60 rows, and I am pleased that these values appear as integers. But what I don't see is how to add the fp$estimate,fp$sd values obtained from my analyses to vectors to form the last two columns in the data frame. Is there something like a vector type, analogous to the vector class std::vector from C++, that has a push_back function allowing a vector to grow as new values are generated? And suppose I have the following table in MySQL (ignoring for the moment keys and indeces): CREATE TABLE ( id INTEGER UNSIGNED NOT NULL auto_increment, mid INTEGER NOT NULL, y INTEGER NOT NULL, w INTEGER NOT NULL, rate DOUBLE NOT NULL, sd DOUBLE NOT NULL process_date DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ) ENGINE=InnoDB; How would I tell dbWriteTable() that my frame's five columns correspond to mid,y,w,rate and sd in that order, and that the fields id and process_date will take the appropriate default values? Or do I need a temporary table, in memory, that has only the five columns, and use a stored procedure to move the data to its final home? Thanks again, Ted On Wed, Oct 15, 2008 at 9:57 PM, Gabor Grothendieck [EMAIL PROTECTED] wrote: Put the data in an R data frame and use dbWriteTable() to write it to your MySQL database directly. On Wed, Oct 15, 2008 at 9:34 PM, Ted Byers [EMAIL PROTECTED] wrote: Here is my little scriptlet: optdata = read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat, header = FALSE, na.strings=) attach(optdata) library(MASS) setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel) for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=); y = x[,1]; fp = fitdistr(y,exponential); print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd)) } And here are the first few lines of output: rate rate 2.51e+02 2.008000e+03 1.80e+01 6.869301e-02 6.462095e-03 rate rate 2.51e+02 2.008000e+03 1.90e+01 5.958023e-02 4.491029e-03 rate rate 2.51e+02 2.008000e+03 2.00e+01 8.631714e-02 7.428996e-03 rate rate 2.51e+02 2.008000e+03 2.20e+01 1.261538e-01 1.137491e-02 rate rate 2.51e+02 2.008000e+03 2.30e+01 1.339523e-01 1.332875e-02 rate rate 2.51e+02 2.008000e+03 2.40e+01 8.916084e-02 1.248501e-02 There are only two things wrong, here. 1) the first three columns are integers, and are output variously as integers, floating point numbers and, as shown here, in scientific notation. 2) this output isn't going to a file or to my DB. This second issue isn't much of a problem, as I think I know now how to deal with it. This output data is, in one sense, perfectly organized, and there is a table with a nearly identical structure (these five columns, plus one to hold the date on which the analysis is performed (and of course, therefore, it has a default value of the current timestamp - handled in MySQL). If I can get the data written to a CSV file, with the first three columns provided as integers, I can use the DB's bulk load utility to get the data into the DB, and this may be faster than having this scriptlet connecting directly to the DB to insert the data (unless the DBI has a function for a bulk load that helps here). Any idea how best
[R] Getting frustrated with RMySQL
Getting the basic stuff to work is trivially simple. I can connect, and, for example, get everything in any given table. What I have yet to find is how to deal with parameterized queries or how to do a simple insert (but not of a value known at the time the script is written - I ultimately want to put my script into a scheduled task, so the analysis can be repeated on updated data either daily or weekly). Using INSERT INTO myTable (a) VALUES (1) is simple enough, but what if I want to insert a sample number (using, e.g. WEEK(sample_date) as a sample identifier) along with the rate parameter estimated using fitdistr to fit an exponential distribution to a dataset, along with its sd? If I were using Perl or Java, I'd set up the query similar to INSERT INTO myTable (a,b,c) VALUES (?,?,?), and then use function calls to set each of the query parameters. I am having an aweful time finding the corresponding functions in RMySQL. And for the data, the simplest, and most efficient, way to get the data is to use a statement like: SELECT a,b,c FROM myTable GROUP BY g_id, WEEK(sdate); The data is in MySQL, and my analysis needs to be applied independantly to each group obtained from a query like this. It appears I can't use a data frame since none of the samples are of the same size (lets say the probability of the samples being the same size in indistinguishable from 0). Is it possible to put the resultset from such a query into a list of vectors that I can iterate over, passing each vector to fitdistr in turn? If so, how? I know I can get this using Perl (by getting each sample individually and writing it to a file, then having R read the file, do the analysis and write the output to another file, and then have Perl parse the output file to insert the parameter estimates I need into the appropriate table), but that seems inefficient. Is it possible to do all I need with R working directly with MySQL? If so, can someone fill in the apparent gaps left in the RMySQL documentation? Thanks. Ted -- View this message in context: http://www.nabble.com/Getting-frustrated-with-RMySQL-tp19980592p19980592.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Getting frustrated with RMySQL
Thanks Jeffrey and Barry, I like the humour. I didn't know about xkcd.com, but the humour on it is familiar. I saw little Bobby Tables what seems like eons ago, when I first started cgi programming. Anyway, I recognized the risk of an injection attack with this use of sprint, but in this case, there is no risk because all the data used is coming from previously sanitized data in our DB, and the parameters in this case will invariably be integers. Thanks again Ted Jeffrey Horner wrote: Barry Rowlingson wrote on 10/14/2008 04:40 PM: 2008/10/14 Jeffrey Horner [EMAIL PROTECTED]: I've found the best way to parameterize is using R's sprintf function. For instance, the following query not only parameterizes the variable position, but also the table name: fields - dbGetQuery(con,sprintf(select field,elem_label from %s_meta where field='%s',inp$pnid,inp$field)) And thus a million web SQL injection exploits were born... Even if you do have control over the parameters to the query, you still have to worry about quotes or other nasty escape characters in your string ending up in the SQL. I hope little Bobby Tables isn't a subject in your analysis: Thank goodness I don't do analysis, as I haven't the schooling. Barry, I'm ashamed of you! I was hoping you'd at least offer an alternative. http://xkcd.com/327/ Okay, you are pardoned: I LOVE xkcd! Especially this one: http://xkcd.com/349/ Best, Jeff -- http://biostat.mc.vanderbilt.edu/JeffreyHorner __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Getting-frustrated-with-RMySQL-tp19980592p19983073.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Getting frustrated with RMySQL
That is neat Gabor. Thanks, Ted Gabor Grothendieck wrote: The gsubfn package can do quasi perl-style interpolation by prefacing any function call with fn$. library(gsubfn) x - 3 fn$dbGetQuery(con, select * from myTable where myColumnA = $x and MyColumnB = `2*x` ) See http://gsubfn.googlecode.com On Tue, Oct 14, 2008 at 5:32 PM, Jeffrey Horner [EMAIL PROTECTED] wrote: Ted Byers wrote on 10/14/2008 02:33 PM: Getting the basic stuff to work is trivially simple. I can connect, and, for example, get everything in any given table. What I have yet to find is how to deal with parameterized queries or how to do a simple insert (but not of a value known at the time the script is written - I ultimately want to put my script into a scheduled task, so the analysis can be repeated on updated data either daily or weekly). Using INSERT INTO myTable (a) VALUES (1) is simple enough, but what if I want to insert a sample number (using, e.g. WEEK(sample_date) as a sample identifier) along with the rate parameter estimated using fitdistr to fit an exponential distribution to a dataset, along with its sd? If I were using Perl or Java, I'd set up the query similar to INSERT INTO myTable (a,b,c) VALUES (?,?,?), and then use function calls to set each of the query parameters. I am having an aweful time finding the corresponding functions in RMySQL. I've found the best way to parameterize is using R's sprintf function. For instance, the following query not only parameterizes the variable position, but also the table name: fields - dbGetQuery(con,sprintf(select field,elem_label from %s_meta where field='%s',inp$pnid,inp$field)) Best, Jeff And for the data, the simplest, and most efficient, way to get the data is to use a statement like: SELECT a,b,c FROM myTable GROUP BY g_id, WEEK(sdate); The data is in MySQL, and my analysis needs to be applied independantly to each group obtained from a query like this. It appears I can't use a data frame since none of the samples are of the same size (lets say the probability of the samples being the same size in indistinguishable from 0). Is it possible to put the resultset from such a query into a list of vectors that I can iterate over, passing each vector to fitdistr in turn? If so, how? I know I can get this using Perl (by getting each sample individually and writing it to a file, then having R read the file, do the analysis and write the output to another file, and then have Perl parse the output file to insert the parameter estimates I need into the appropriate table), but that seems inefficient. Is it possible to do all I need with R working directly with MySQL? If so, can someone fill in the apparent gaps left in the RMySQL documentation? Thanks. Ted -- http://biostat.mc.vanderbilt.edu/JeffreyHorner __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Getting-frustrated-with-RMySQL-tp19980592p19983099.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Applying an R script to data within MySQL? How to?
I am trying something I haven't attempted before and the available documentation doesn't quite answer my questions (at least in a way I can understand). My usual course of action would be to extract my data from my DB, do whatever manipulation is necessary, either manually or using a C++ program, and then import the data into R. Now I need to try to do it all within R+RMySQL+MySQL. I just managed to connect to MySQL and retrieve data using RMySQL as follows: library(DBI) library(RMySQL) MySQL(max.con = 16, fetch.default.rec = 500, force.reload = F) MySQLDriver:(3800) m - dbDriver(MySQL) con - dbConnect(m, user=rejbyers, password = jesakos, host=localhost, dbname = merchants2) rs - dbSendQuery(con, select * from merchants) df - fetch(rs, n = 150) df And of course, that last statement is followed by the entire contents of merchants Now, I have a script like the following: refdata18 = read.csv(K:\\MerchantData\\RiskModel\\ndays18.csv, na.strings=) x1 = refdata18[,1] library(MASS) ex1 = fitdistr(x1,exponential) str(ex1) Now, the contents of ndaysXX.csv represent records where one of the date values is in week XX of the current year. We don't yet have data spanning multiple years, and will have to modify the SQL that gets the data accordingly. At present, my SQL statement groups records by WEEK of the year, and then I manually separate weeks in a CSV file outside the DB. Suppose I make a query like: SELECT ndays FROM xxx GROUP BY WEEK(tdate); There is no a priori of knowing just how many weeks of data there are. My reason for asking is I see information in the documentation about dbApply(RMySQL) which says: Applies R functions to groups of remote DBMS rows without bringing an entire result set all at once. The result set is expected to be sorted by the grouping field. There is an example, but the example doesn't make much sense (the query used, for example, does not contain a GROUP BY clause). I can easily set up a table that could be used to manage the output I need (primarily the rate value estimated for each week, and the SD of the estimate), but at present I am at a loss as to how to proceed to set this up. Can some kind soul out there give me rather pedantic instructions on how to use RMySQL to apply, in my case fitdistr, independantly to each group of values returned by my simplistic SQL query above, and insert the rate and sd into another table? I know I can handle all this using a perl script to create a suite of temporary files, and process them one by one, but I have also been advised to try to use R instead of Perl for this kind of task. A slightly related question is this: Assuming I can get this all working from within R, how would I make it a scheduled task on the one hand, or, on the other hand, run it on demand from an event on a web page (which at present is made using a combination of PHP, Apache's httpd server and MySQL, if that matters)? Of course, if I can make such an R script (or even store it as a function) there should be no memory from one instance to another, because the same analysis would have to be done on different users' data. Thanks Ted -- View this message in context: http://www.nabble.com/Applying-an-R-script-to-data-within-MySQL---How-to--tp19888407p19888407.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] What distribution is related to hypergeometric?
I have been reading, in various sources, that a poisson distribution is related to binomial, extending the idea to include numbers of events in a given period of time. In my case, the hypergeometric distribution seems more appropriate, but I need a temporal dimension to the distribution. I have weekly samples of two kinds of events: call them A and B. I have a count of A events. These change dramatically from one week to the next. I also have weekly counts of B events that I can relate to A events. Some fraction 'lambda' (between 1 and 1) of A events will result in B events some time in the future (but also sometimes in the same week that the related A event occured). The B event related to a given A event can occur as much as ten weeks after the A event. B events can not occur without a prior A event, and well over half of the A events will never produce a B event. Also, we know that a given A event can not produce more than one B event. Hence hypergeometric is much more appropriate than binomial, and thus my need for the distribution that has the same relation to the hypergeometric that the poisson has to binomial. Since hypergeometric is related to binomial, would poisson also be related to hypergeometric? My data is best expressed as a fraction: number of B events in a given week divided by the number of A events producing the B events. I.e. if there are 500 A events in week n, the data would be the number of related B events in week m (m = n) divided by 500. and the first table I get from the DB has records containing an ordered pair: week number, fraction. E.g. 0,0.2 1,0.3 2,0.25 3,0.2 ... The above is dummy data, but the pattern I see in the data is that the number of B events in week 0 is less than the number of B events in week 1, but from then on, the number of B events declines exponentially (as you'd expect from what could be described as a decay process, altered to reflect the fact that over half of the original A events will never produce B events). Of all the distributions I tried on this data, exponential and poisson produced the best fits, with very little to choose between them. Always, the cumulative fraction of A events that have produced B events approaches an asymptote between 0.25 and 0.45. Never higher, but now it looks like the asymptotes are getting smaller (the behaviour of the system is changing). In a sense, this breaks down into two questions: 1) What distribution should I try to fit to my data? 2) How do I present my data to the functions that will try to fit the distribution to this data? The reason for the second is that, while I have examined lots of functions (fBasics, MASS, c.) that will try to fit a distribution to data, they all seem to expect a 1D vector of data and none of them say anything about the data, or what to do if you already have an empirical (cumulative) distribution. To try out the functions that fit distributions, I created a dummy vector where the initial sample size was 1000, and the number of values equal to a given week number would be 1000 * the faction of A events that produced B events. E.g. (using the sample numbers above, there'd be 200 '0's, 300 '1's, 250 '2's, c.) Thanks Ted -- View this message in context: http://www.nabble.com/What-distribution-is-related-to-hypergeometric--tp19671054p19671054.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] What distribution is related to hypergeometric?
I have weekly samples of two kinds of events: call them A and B. I have a count of A events. These change dramatically from one week to the next. I also have weekly counts of B events that I can relate to A events. Some fraction 'lambda' (between 1 and 1) of A events will result in B events some time in OOPS, that OUGHT to have been between 0 and 1. Ted -- View this message in context: http://www.nabble.com/What-distribution-is-related-to-hypergeometric--tp19671054p19671301.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Please help me interpret these results (fitting distributions to real data)
I just thought of a useful metaphore for the problem I face. I am dealing with a problem in business finance, with two kinds of related events. However, imagine you have a known amount of carbon (so many kilograms), but you do not know what fraction is C14 (and thus radioactive). Only the C14 will give decay events (and once that event has occurred, the atom that decayed will never decay again). C12 will never decay. What you want to know is a) what is the ratio of C12 to C14 at time 0, and b) how many decay events will happen between time X and time y, or how many decay events will happen after time z. That integral, is, IIRC, quite simple. The data you get from your equipment will be a number of decay events in time period n (could be a specific week or a specific day). How would you get this data into R so that you can use, say, fitdistr(MASS) to estimate the decay rate, and then proceed to answer the questions of interest? Anyway, in my early tests (before I figured out which distribution is most appropriate in this case), I got the following results (this is for one week's data, but other weeks' result are similar). ==curious results= ex15 = fitdistr(x15,exponential) str(ex15) List of 4 $ estimate: Named num 0.0653 ..- attr(*, names)= chr rate $ sd : Named num 0.00356 ..- attr(*, names)= chr rate $ n : int 337 $ loglik : num -1256 - attr(*, class)= chr fitdistr ge15 = fitdistr(x15,geometric) str(ge15) List of 4 $ estimate: Named num 0.0613 ..- attr(*, names)= chr prob $ sd : Named num 0.00324 ..- attr(*, names)= chr prob $ n : int 337 $ loglik : num -1257 - attr(*, class)= chr fitdistr po15 = fitdistr(x15,poisson) str(po15) List of 4 $ estimate: Named num 15.3 ..- attr(*, names)= chr lambda $ sd : Named num 0.213 ..- attr(*, names)= chr lambda $ n : int 337 $ loglik : num -2721 - attr(*, class)= chr fitdistr nb15 = fitdistr(x15,negative binomial) Warning messages: 1: In dnbinom(x, size, prob, log) : NaNs produced 2: In dnbinom(x, size, prob, log) : NaNs produced 3: In dnbinom(x, size, prob, log) : NaNs produced str(nb15) List of 4 $ estimate: Named num [1:2] 0.973 15.309 ..- attr(*, names)= chr [1:2] size mu $ sd : Named num [1:2] 0.0786 0.8719 ..- attr(*, names)= chr [1:2] size mu $ loglik : num -1267 $ n : int 337 - attr(*, class)= chr fitdistr AIC(ex15) [1] 2514.952 AIC(ge15) [1] 2516.273 AIC(po15) [1] 5444.62 AIC(nb15) [1] 2538.385 =end curious results= Notice that the AIC for the exponential and geometric distributions are almost idential, and that for the negative binomial is not much different. This now makes some sense; the geometric being a discrete equivalent of the exponential, as well as being a special case of the negative binomial. Right? With such relationships among them, it would not be surprising to see them give similar values of AIV. Right? Thanks Ted -- View this message in context: http://www.nabble.com/Please-help-me-interpret-these-results-%28fitting-distributions-to-real-data%29-tp19678782p19678782.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Statistical question re assessing fit of distribution functions.
Thanks Timur While assessing whether or not the best option would be a normal distribution (it won't be, the data in this case LOOKS more poisson, or if I explude the first week of results, a negative exponential; and in my other case, cauchy is more likely), I really need a test that can be applied regardless of the distribution to see which distribution fits best. Using log-likelihood, there doesn't seem to be much to choose between exponential and poisson (the log-likelihhod for them being almost the same, regardless of the sample even tough the parameters are very different from one sample to the next - I don't understand why yet), and the others I have tried are MUCH worse, but I'm not done yet. Are you aware of functions that allow estimation of all the parameters of a non-central distribution? I ask because a problem I'll be working on in a few weeks will involve the kind of skew produced by a non-central distribution (among others). I see some functions allow you to work with skewed distributions (e.g. [dpqr]stable the skewed stable distribution ) but I have not yet found functions that alow one to estimate their parameters from real data. Thanks, Ted Timur Shtatland wrote: If one of the goals is the normality test, then there may be better alternatives to the Kolmogorov-Smirnov test. See an explanation on: http://graphpad.com/FAQ/viewfaq.cfm?faq=959 The R implementation: ?shapiro.test A casual search also turned this up: http://tolstoy.newcastle.edu.au/R/help/04/09/3201.html http://tolstoy.newcastle.edu.au/R/help/04/08/3121.html http://www.karlin.mff.cuni.cz/~pawlas/2008/MAI061/dagost.R Best, Timur -- Timur Shtatland, Ph.D. Senior Bioinformatics Scientist Agencourt Bioscience Corporation - A Beckman Coulter Company 500 Cummings Center, Suite 2450 Beverly, MA 01915 www.agencourt.com On Mon, Sep 22, 2008 at 12:26 PM, Ted Byers [EMAIL PROTECTED] wrote: I am in a situation where I have to fit a distrution, such as cauchy or normal, to an empirical dataset. Well and good, that is easy. But I wanted to assess just how good the fit is, using ks.test. I am concerned about the following note in the docs (about the example provided): Note that the distribution theory is not valid here as we have estimated the parameters of the normal distribution from the same sample This implies I should not use ks.test(x,pnorm,mean =1.187, sd =0.917), where the numbers shown are estimated from 'x'. If this is so, how do I get a correct test? I know I can not use different samples because of just how different the parameters are from one sample to the next, so using parameters estimated from the sample from week one to define the distribution function for ks.test will give a poor fit for the data from week two. And the sample size is small enough that I would not have confidence in the parameters estimated from a portion of a samlpe to fit against the remainder of the sample. Thanks Ted -- View this message in context: http://www.nabble.com/Statistical-question-re-assessing-fit-of-distribution-functions.-tp19611539p19611539.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Statistical-question-re-assessing-fit-of-distribution-functions.-tp19611539p19629108.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Trouble understanding the behaviour of stableFit(fBasics)
Can anyone explain such different output: stableFit(s,alpha = 1.75, beta = 0, gamma = 1, delta = 0, + type = c(q, mle), doplot = TRUE, trace = FALSE, title = NULL, + description = NULL) Title: Stable Parameter Estimation Call: .qStableFit(x = x, doplot = doplot, title = title, description = description) Model: Student-t Distribution Estimated Parameter(s): alpha beta gamma delta 1.534 0.275 0.3211991 -0.9922306 Description: Tue Sep 23 22:18:44 2008 by user: Ted refdata18 = read.csv(C:\\MerchantData\\RiskModel\\Capture.Week.18.csv, na.strings=) stableFit(refdata18[,1],alpha = 1.75, beta = 0, gamma = 1, delta = 0, + type = c(q, mle), doplot = TRUE, trace = FALSE, title = NULL, + description = NULL) Title: Stable Parameter Estimation Call: .qStableFit(x = x, doplot = doplot, title = title, description = description) Model: Student-t Distribution Estimated Parameter(s): alpha beta gamma delta NANANANA Description: Tue Sep 23 22:20:23 2008 by user: Ted I am just playing with it right now, trying to understand how to call it, so first I passed the s vector from the example. I don't care about the result except to know that stableFit accepted the input and obtained an estimate for the parameters. The I tried my data (a vector in integers, with a distribution that looks similar to poisson, but exponential and geometric give better fits). What I find puzzling is that I get no error messages complaining about one property or another of my data, to explain why there are no parameter estimates. The data I WILL be applying this to comes from the financial markets, and will be reals or floating point numbers that in some cases wil be best modelled by a normal distribution while in most cases, the distribution will be closer to cauchy. (but DistributionFits(fBasics) makes no explicit mention of cauchy, but IIRC cauchy is a special case of a stable distribution one of a family - are these the L-stable distributions Mandelbrot discussed, or something else - correct me if my memory has failed me sooner than anticipated ;-) An URL for a website discussing these in some detail would be handy as my stats texts, dated as they are and focussed more on applied biometrics, don't talk about these. What do I look at if this function just gives me a bunch of 'NA's instead of parameter estimates? And, givent he structure of the documentation, it is not clear if I can get an estimate of skewness for all the distributions or for all except t and normal distributions if I am using DistributionFits. Thanks Ted -- View this message in context: http://www.nabble.com/Trouble-understanding-the-behaviour-of-stableFit%28fBasics%29-tp19640972p19640972.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Re lative novice: Working with fitdistr(MASS): 3 questions
OK, I am now at the point where I can use fitdistr to obtain a fit of one of the standard distributions to mydata. It is quite remarkable how different the parameters are for different samples through from the same system. Clearly the system itself is not stationary. Anyway, question 1: I require a visual perspective of the fit I get. I can use hist.scott to get a hisogram (and just have to figure out how to get finer granularity from it - my samples are taken weekly, but the histogram bars cover two weeks of data and the most interesting changes happen in the first three to four weeks - after that things slow down tremendously), but how would I overlay a plot of the best distribution I get from fitdistr over it? Second question: I don't see anything in the documentation for fitdistr that says anything about using the distribution obtained to integrate the distribution over some range of values. I get weekly sampled, and for each sample I get a certain number of events each week for about three months. I need to be able to use the distribution to estimate the number of such events next week or the week after, and how long it will be that the probability of such an event is so low that no more of them are likely to be observed from that sample ever. What package or functions should I be looking at here to get this done? Third question: I see nothing in the docs about non-central distributions. The distribution most likely to fit is cauchy, but we know that there is skew that depends on the magnitude: large positive deviates are more common that large negative deviates, but extremely large positive deviates are less common that extremely large negative deviates. What we don't know is how significant such skewness is for the overall distribution. How can I assess this, or can I assess this, using fitdistr (or some other function I haven't found yet)? Thanks Ted -- View this message in context: http://www.nabble.com/Relative-novice%3A-Working-with-fitdistr%28MASS%29%3A-3-questions-tp19610812p19610812.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Statistical question re assessing fit of distribution functions.
I am in a situation where I have to fit a distrution, such as cauchy or normal, to an empirical dataset. Well and good, that is easy. But I wanted to assess just how good the fit is, using ks.test. I am concerned about the following note in the docs (about the example provided): Note that the distribution theory is not valid here as we have estimated the parameters of the normal distribution from the same sample This implies I should not use ks.test(x,pnorm,mean =1.187, sd =0.917), where the numbers shown are estimated from 'x'. If this is so, how do I get a correct test? I know I can not use different samples because of just how different the parameters are from one sample to the next, so using parameters estimated from the sample from week one to define the distribution function for ks.test will give a poor fit for the data from week two. And the sample size is small enough that I would not have confidence in the parameters estimated from a portion of a samlpe to fit against the remainder of the sample. Thanks Ted -- View this message in context: http://www.nabble.com/Statistical-question-re-assessing-fit-of-distribution-functions.-tp19611539p19611539.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Why isn't R recognising integers as numbers?
Thanks Jim, Alas, it wasn't this. Here is the output from both of your suggestions: refdata18 = read.csv(K:\\MerchantData\\RiskModel\\Capture.Week.18.csv, header = TRUE,na.strings=) str(refdata18) 'data.frame': 341 obs. of 1 variable: $ X0: int 0 0 0 0 0 0 0 0 0 0 ... scan(K:\\MerchantData\\RiskModel\\Capture.Week.18.csv, what=0L) Read 342 items [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [26] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [51] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [76] 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [101] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [126] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [151] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 [201] 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 [226] 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 [251] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 [276] 7 7 7 8 8 8 8 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 [301] 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 [326] 12 12 12 18 18 18 18 18 18 18 18 18 18 18 18 18 18 Thanks anyway. Ted jholtman wrote: best guess is that they are not integers. Do 'str' on your object and it probably says they are 'factors'. This is probably due to some of your data being non-numeric. Try using 'colClasses' on read.csv to specify what the column should contain. Also try scan after skipping the first record if it is a header: scan(, what=0L) # bad input after specifying integer 1: 1 2 3 4 5: 1 v 5: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'an integer', got 'v' scan(, what=0L) # good input 1: 1 2: 2 3: 3 4: Read 3 items [1] 1 2 3 On Sun, Sep 21, 2008 at 9:01 PM, Ted Byers [EMAIL PROTECTED] wrote: I have a number of files containing anywhere from a few dozen to a few thousand integers, one per record. The statement refdata18 = read.csv(K:\\MerchantData\\RiskModel\\Capture.Week.18.csv, header = TRUE,na.strings=) works fine, and if I type refdata18, I get the integers displayed, one value per record (along with a record number). However, when I try fitdistr(refdata18,negative binomial), or hist.scott(refdata18, prob = TRUE), I get an error: Error in fitdistr(refdata18, negative binomial) : 'x' must be a non-empty numeric vector Or Error in hist.default(x, nclass.scott(x), prob = prob, xlab = xlab, ...) : 'x' must be numeric How can it not recognise integers as numbers? Thanks Ted -- View this message in context: http://www.nabble.com/Why-isn%27t-R-recognising-integers-as-numbers--tp19600308p19600308.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Why-isn%27t-R-recognising-integers-as-numbers--tp19600308p19600695.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Why isn't R recognising integers as numbers?
Thanks Marc, That was it. For the last 30 years, I'd write my own code, in FORTRAN, C++, or even Java, to do whatever statistical analysis I needed. When at the office, sometimes I could use SAS, but that hasn't been an option for me in years. This is the first time I have had to load real data into R (instead of generating random data to use while playing with some of the stats functions, or manually typing dummy data). I take it, then, that the result of loading data is a data frame, and not just a matrix or array. Using something like refdata18[, 1] feels rather alien, but I'm sure I'll quickly get used to it. I'd seen it before in the R docs, but it didn't register that I had to use it to get the functions of most interest to me to recognise my data as a vector of numbers, given I'd provided only a vector of integers as input. Thanks Ted Marc Schwartz wrote: on 09/21/2008 08:01 PM Ted Byers wrote: I have a number of files containing anywhere from a few dozen to a few thousand integers, one per record. The statement refdata18 = read.csv(K:\\MerchantData\\RiskModel\\Capture.Week.18.csv, header = TRUE,na.strings=) works fine, and if I type refdata18, I get the integers displayed, one value per record (along with a record number). However, when I try fitdistr(refdata18,negative binomial), or hist.scott(refdata18, prob = TRUE), I get an error: Error in fitdistr(refdata18, negative binomial) : 'x' must be a non-empty numeric vector Or Error in hist.default(x, nclass.scott(x), prob = prob, xlab = xlab, ...) : 'x' must be numeric How can it not recognise integers as numbers? Thanks Ted 'refdata18' is a data frame and the two functions are expecting a numeric vector. If you use: fitdistr(refdata18[, 1], negative binomial) or hist(refdata18[, 1]) you should get a suitable result, presuming that the first column in the data frame is a numeric vector. Use: str(refdata18) to get a sense for the structure of the data frame, including the column names, which you could then use, instead of the above index based syntax. HTH, Marc Schwartz __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Why-isn%27t-R-recognising-integers-as-numbers--tp19600308p19600803.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Novice question about getting data into R
I found it easy to use R when typing data manually into it. Now I need to read data from a file, and I get the following errors: refdata = read.table(K:\\MerchantData\\RiskModel\\refund_distribution.csv, header = TRUE) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 42 elements refdata = read.table(K:\\MerchantData\\RiskModel\\refund_distribution.csv) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 2 did not have 42 elements (I'd tried the first version above because the first record has column names.) First, I don't know why R expects 42 elements in a record. There is one column for a time variable (weeks since a given week of samples were taken) and one for each week of sampling in the data file (Week 18 through Week 37 inclusive). And there is only 19 rows. The samples represented by the columns are independant, and the numbers in the columns are the fraction of events sampled that result in an event of another kind in the week since the sample was taken. The samples are not the same size, and starting with week 20, the number of values progressively gets smaller since there have been fewer than 37 weeks since the samples were taken. I can show you the contents of the data file if you wish. It is unremarkable, csv, with strings used for column names enclosed in double quotes. I don't have to manually separate the samples into their own files do I? I was hoping to write a function that estimates the density function that best fits each sample individually, and then iterate of the columns, applying that function to each in turn. What is the best way to handle this? Thanks Ted -- View this message in context: http://www.nabble.com/Novice-question-about-getting-data-into-R-tp19576065p19576065.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Novice question about getting data into R
Thanks one and all. Actually, I used OpenOffice's spreadsheet to creat the csv file, but I have been using it long enough to know to specify how I wanted it, and sometimes, when that proves annoying, I'll use Perl to finess it the way I want it. It seems my principle error was to assume that it would ignore the character strings within the double quotes and determine fields based on the commas. Silvia's remarks about empty cells and blanks in the middle of column names were right on the mark. Tom, I appreciate the caveats you mention. I am aware of the complications of i18n, but they don't affect me much as my stuff is run exclusively in Canada (pretty much the same norms as the US). They don't affect me (in a sense because I have manipuated data around such issues using perl in order to satisfy the peculiarities of the software used on one project or another - I deal with it almost as a matter of course, as long as I already know the peculiarities of the software I am working with), and I have plenty of experience moving data between spreadsheets, RDBMS such as MS SQL, PostgreSQl, MySQL, and XML files, and have had to resort to unusual delimiters in the past because of peculiarities in the data feed. While I have tonnes of experience developing software (C++, Java, FORTRAN, perl) I only started playing with R a few months ago, and this is the first I have had to import real data into it. While the tutorials I found were useful, it seems there are key tidbits of information I need scattered through the documentation and I am finding it challenging to find the peculiarities of R. Thanks again one and all. Ted Tom Backer Johnsen wrote: Silvia Lomascolo wrote: refdata = read.table(K:\\MerchantData\\RiskModel\\refund_distribution.csv, header = TRUE) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 42 elements refdata = read.table(K:\\MerchantData\\RiskModel\\refund_distribution.csv) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 2 did not have 42 elements R interprets that you have 42 columns from the variable names. Do you? See if removing spaces between column names helps (e.g., week.1 instead of week 1). Also, because yours is a csv file, fields are separated by comas. You can either use the read.csv command instead of the read.table (see ?read.table for details), or add the argument sep=, to tell R that fields are separated by comas. You might also need to specify, if you have empty cells, what to do with them (e.g., na.strings=) You are of course right about the NA's (missing values, empty cells) as well as the possible blanks in the column names. It might nevertheless be a good idea for him to at least submit a few of the lines at the top of the file. A .csv file as generated by Excel on Windows is not necessarily comma-separated. That depends on the list separator setting under Regional Language Settings found in the Control Panel. On my machine, the list separator is a semicolon for a .csv file. The reason is simple, in Norway, the standard decimal separator is a comma, and you do not want to confuse the system too much. So, that particular point is dependent on the settisngs for his locale (language, country). Tom -- ++ | Tom Backer Johnsen, Psychometrics Unit, Faculty of Psychology | | University of Bergen, Christies gt. 12, N-5015 Bergen, NORWAY | | Tel : +47-5558-9185Fax : +47-5558-9879 | | Email : [EMAIL PROTECTED]URL : http://www.galton.uib.no/ | ++ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Novice-question-about-getting-data-into-R-tp19576065p19577763.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Use of distribution model to estimate probability of an event
I have a situation where there ae two kinds of events: A and B. B does not occur without A occuring first, and a percentage of A events lead to an event B some time later and the remaining ones do not. I have n independant samples, with a frequency of events B by week, until event B for a given week's A events no longer happen (after about 10 weeks, the chance of another B event is less than 0.1%). That gives me good enough data to determine which distribution fits the data. But looking at the data for several weeks of A event, it is clear that although the distributions have a similar shape (e.g. the corresponding B events peak on week two), there are significant differences between weeks of A events regarding the fraction of them that lead to B events (sometimes it is 25% and sometimes it is 45%, with dozens of values in between being observed). I know how to use R to fit the distributions. The question is, once I have fit a distribution to the data (i.e. I know the distribution and it s parameters that give the best fit, is there a function in R that I can use to obtain the number of events of type B will occur in week M (knowing the number of A events, and a density function, all I need is the probability of a B event in the week of interest - a simple forecast since the week for which we want the answer hasn't come yet - a simple forecast model), given a number of A events in a prior week N? If so, just tell me the name of the function and the package, and I'll find it and read up on it. This is for the development of a model of risk (A events being desirable and B events representing a cost to all concerned). It is a simple enough model, but I am having a little trouble finding the last piece of the puzzle that I need. Thanks Ted -- View this message in context: http://www.nabble.com/Use-of-distribution-model-to-estimate-probability-of-an-event-tp19565047p19565047.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] library/function that estimates parameters of well known distributions from empirical data?
Thanks Ben That was the one I'd remembered but couldn't find. Mark Leeds also told me about DistributionFits(fBasics), which I hadn't seen. There seems to be only a little overlap between the two. Could I trouble you to expand on AIC (esp. what the function name and package is to apply it to the output from these two functions)? I just read the help provided for each and neither mentions AIC. Thanks again Ben Ted Ben Bolker wrote: Ted Byers r.ted.byers at gmail.com writes: I found this a few months ago, but for the life of me I can't remember what the function or package was, and I have had no luck finding it this week. I have found, again, the functions for working with distributions like Cauchy, F, normal, c., and ks.test, but I have not found the functions for estimating the distribution parameters given a vector of values. Look at the fitdistr function in the MASS package. Consider AIC comparisons for ranking the fits to these non-nested models. good luck Ben Bolker __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/library-function-that-estimates-parameters-of-well-known-distributions-from-empirical-data--tp19323700p19339442.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] library/function that estimates parameters of well known distributions from empirical data?
I found this a few months ago, but for the life of me I can't remember what the function or package was, and I have had no luck finding it this week. I have found, again, the functions for working with distributions like Cauchy, F, normal, c., and ks.test, but I have not found the functions for estimating the distribution parameters given a vector of values. What I need to do is estimate the distribution parameters for each candidate distribution, and then test to see which gives the best fit to the data. I want to examine the question, given this dataset (which may have thousands of records), does the normal or cauchy distribution fit the data best, and which what parameters. It will not be known a priori whether or not the most appropriate distribution is non-central, though we do know that often (not always) values of medium size in absolute value are more often positive than negative and that very large values are more often negative than positive. Could someone please give me a gentle reminder of the package and function(s) I ought to be examining? Thanks Ted -- View this message in context: http://www.nabble.com/library-function-that-estimates-parameters-of-well-known-distributions-from-empirical-data--tp19323700p19323700.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] license for a university
Erin, I trust you know what you risk when you assume. ;-) There IS a license, but it basically lets you copy or distribute it, or, in your case, install on as many machines as you wish. It is the GNU GENERAL PUBLIC LICENSE. Like most open source software I use, the Gnu license is in place primarly to ensure everyone can freely use it. Cheers Ted Erin Hodgess-2 wrote: Dear R People: I am trying to install R in a classroom here, but have been told that there must be a license. Is there such a thing with R, please? Since it is free, I assumed that there would be no license. Thanks for any help, Sincerely, Erin -- Erin Hodgess Associate Professor Department of Computer and Mathematical Sciences University of Houston - Downtown mailto: [EMAIL PROTECTED] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/%22license%22-for-a-university-tp1928p19300187.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.