[R] problems with rollapply {zoo}

2012-01-23 Thread Ted Byers
Here is a relatively simple script (with comments as to the logic
interspersed):

 

# Some of these libraries are probably not needed here, but leaving them in
place harms nothing:

library(tseries)

library(xts)

library(quantmod)

library(fGarch)

library(fTrading)

library(ggplot2)

# Set the working directory, where the data file is located, and read the
raw data

setwd('C:/cygwin/home/Ted/New.Task/NKS-quotes/NKS-quotes')

x = read.table(quotes_M11.dat, header = FALSE, sep=\t, skip=0)

str(x)

# Set up the  date column

dt-sprintf(%s %04d,x$V2,x$V4)

dt-as.POSIXlt(dt,format=%Y-%m-%d %H%M)

# Prepare a frame that gets converted to an xts object

y - data.frame(dt,x$V5)

colnames(y) - c(tickdate,price)

# Make the xts object, and then the OHLC object (as an aside, the tick data
includes volume, but I have yet to figure out how to make an OHLC object hat
includes volume)

z - xts(y[,2],y[,1])

alpha - to.minutes3(z, OHLC=TRUE)

colnames(alpha) - c(Open,High,Low,Close)

alpha$rel_t - seq(1-nrow(alpha),0)

# Just to check the code for the regression, apply the regression to the
whole series (unless the series is realy short or has a strong slow pattern
the regression result is not useful except to show that the code works)

polyfit - lm(Close ~ poly(rel_t,4),alpha)

polyfit2 - lm(Close ~ rel_t + I(rel_t^2) + I(rel_t^3) + I(rel_t^4),
data=alpha)

# This is the objective, where all the magic happens

rollRegFun - function(d,i) {

# set up the relative time variable, so that the current record has rt = 0

  d$rt - seq(1-nrow(d),0)

# apply the regression to fit a 4th degree polynomial in rt

  polyfit - lm(Close ~ poly(rt,4),d)

# get the coefficients

  p - coef(polyfit)

# get the roots of the first derivative of the fitted polynomial

  pr - polyroot(c(p[2],2*p[3],3*p[4],4*p[5]))

# define a function that evaluates the second derivative as a function of x

  dd - function(x) {  rv = 2*p[3]+6*p[4]*x+12*p[5]*x*x;rv;}

# evaluate the second derivative at the ith root, and print the result

  r - dd(pr[i])

  r

}

 

rollRegFun(alpha,1)

rollRegFun(alpha,2)

rollRegFun(alpha,3)

 

The code I show above does not give an error, but if the function is
re-written as:

 

rFun - function(d) {

  d$rt - seq(1-nrow(d),0)

  polyfit - lm(Close ~ poly(rt,4),d)

  p - coef(polyfit)

  pr - polyroot(c(p[2],2*p[3],3*p[4],4*p[5]))

  dd - function(x) {  rv = 2*p[3]+6*p[4]*x+12*p[5]*x*x;rv;}

  r - dd(pr[1])

  r

}

 

And I try to get rollapply to execute it on a moving window, I get errors.
E.g.

 

 rollapply(as.zoo(alpha),60,rFun)

Error in from:to : argument of length 0

 

Yet, the following works:

 

rollapply(alpha$Close,60,mean)

 

what do I have to do to either my function or my use of rollapply in order
to get it to work?

 

Thanks

 

Ted


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] plotOHLC(alpha3): Error in plotOHLC(alpha3) : x is not a open/high/low/close time series

2012-01-11 Thread Ted Byers
Hi Joshua,

Thanks.

I had used irts because I thought I had to.  The tick data I have has some
minutes in which there is no data, and others when there are hundreds, or
even thousands.  If xts supports irregular data, the that is one less step
for me to worry about.

Alas, your suggestion didn't help:

 z - xts(y[,2], y[,1]) 
 alpha3 - to.minutes3(z, OHLC=TRUE) 
 plotOHLC(alpha3)
Error in plotOHLC(alpha3) : x is not a open/high/low/close time series
 str(alpha3)
An ‘xts’ object from 2010-06-30 15:47:00 to 2011-10-31 15:14:00 containing:
  Data: num [1:98865, 1:4] 9215 9220 9205 9195 9195 ...
 - attr(*, dimnames)=List of 2
  ..$ : NULL
  ..$ : chr [1:4] z.Open z.High z.Low z.Close
  Indexed by objects of class: [POSIXct,POSIXt] TZ: 
  xts Attributes:  
 NULL

Is there anything else I might try?

Thanks again,

Ted

--
View this message in context: 
http://r.789695.n4.nabble.com/plotOHLC-alpha3-Error-in-plotOHLC-alpha3-x-is-not-a-open-high-low-close-time-series-tp4283217p4286124.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] plotOHLC(alpha3): Error in plotOHLC(alpha3) : x is not a open/high/low/close time series

2012-01-11 Thread Ted Byers
Thanks Joshua,

That did it.

Cheers,

Ted

--
View this message in context: 
http://r.789695.n4.nabble.com/plotOHLC-alpha3-Error-in-plotOHLC-alpha3-x-is-not-a-open-high-low-close-time-series-tp4283217p4286963.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] plotOHLC(alpha3): Error in plotOHLC(alpha3) : x is not a open/high/low/close time series

2012-01-10 Thread Ted Byers
R version 2.12.0, 64 bit on Windows.

 

Here is a short script that illustrates the problem:

 

library(tseries)

library(xts)

setwd('C:\\cygwin\\home\\Ted\\New.Task\\NKs-01-08-12\\NKs\\tests')

x = read.table(quotes_h.2.dat, header = FALSE, sep=\t, skip=0)

str(x)

y - data.frame(as.POSIXlt(paste(x$V2,substr(x$V4,4,8),sep=
),format='%Y-%m-%d %H:%M'),x$V5)

colnames(y) - c(tickdate,price)

str(y)

plot(y)

z - as.irts(y)

str(z)

plot(z)

str(alpha3)

List of 2

$ time : POSIXt[1:98865], format: 2010-06-30 15:47:00 2010-06-30
15:53:00 2010-06-30 17:36:00 ...

$ value: num [1:98865, 1:4] 9215 9220 9205 9195 9195 ...

  ..- attr(*, dimnames)=List of 2

  .. ..$ : NULL

  .. ..$ : chr [1:4] z.Open z.High z.Low z.Close

- attr(*, class)= chr ts

- attr(*, tsp)= num [1:3] 1 2 1

alpha3 - as.xts(to.minutes3(z,OHLC = TRUE))

plotOHLC(alpha3)

Error in plotOHLC(alpha3) : x is not a open/high/low/close time series

 

The file quotes_h.2.dat contains real time tick data for futures contracts,
so the above manipulation is my attempt to just get a time series with one
column being a date/time and the other being tick price.  I believe I have
to use read.table to make a data frame, and then the manipulations to
combine the date and time fields from that feed, along with the price.

 

My first attempt at using to.minutes3 (and I am interested in the other
'to.period' functions too), is to get a regular time series to which I can
apply rollapply, along with a function in which I use various autoregression
methods, along with forecasting for as long as the 95% confidence intervals
is reasonably close - I want to know how far into the future the forecast
contains useful information.  And then, I want to create a plot in which I
do the autoregression, and then plot the actual and forecast prices (along
with the confidence interval), as a function of time, embed that in a
function, which rollappply works with, so I can have a plot comprised of all
those individual plots (plotting only the comparison of actual and forecast
values).

 

It seems everything works adequately until I try the plotOHLC function
itself, which gives me the error in the subject line.

 

I would ask for two things: 

 

1) what the fix is to get rid of that error plotOHLC gives me

2) some tips on the 'walk-forward' method I am looking at using.

 

Thanks

 

Ted


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How do I get rid of list elements where the value is NULL before applying rbind?

2010-07-22 Thread Ted Byers
Here is the function that makes the data.frames in the list:

funweek - function(df)
  if (length(df$elapsed_time)  5) {
res = fitdist(df$elapsed_time,exp)
year = df$sale_year[1]
sample = df$sale_week[1]
mid = df$m_id[1]
estimate = res$estimate
sd = res$sd
samplesize = res$n
loglik = res$loglik
aic = res$aic
bic = res$bic
chisq = res$chisq
chisqpvalue = res$chisqpvalue
chisqdf = res$chisqdf
if (!is.null(estimate)  !is.null(sd)  !is.null(loglik) 
!is.null(aic)  !is.null(bic) 
!is.null(chisq)  !is.null(chisqpvalue)  !is.null(chisqdf)) {
  rv =
data.frame(mid,year,sample,samplesize,estimate,sd,loglik,aic,bic,chisq,chisqpvalue,chisqdf)
  rv
}
  }

I use the following, with different data, successfully:

z -
lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week),drop
= TRUE), funweek)
qqq - z[, c('mid', 'year', 'sample', 'samplesize', 'estimate', 'sd',
'loglik', 'aic','bic', 'chisq', 'chisqpvalue', 'chisqdf')]
ndf2 - do.call(rbind, qqq)


However, I am now getting the following error:

 qqq - z[, c('mid', 'year', 'sample', 'samplesize', 'estimate', 'sd',
 'loglik', 'aic','bic', 'chisq', 'chisqpvalue', 'chisqdf')]
 Error in z[, c(mid, year, sample, samplesize, estimate, sd,  :
   incorrect number of dimensions



My suspicion is that it is due to the fact that sometimes one or more of the
elements in my conditional block is null, so nothing is returned and that
this puts a null element into z.  Here is a selection of a couple elements
so you can see what is in 'z'.

$`353.2010.0`
  mid year sample samplesize   estimate sd   loglik  aic
 rate 353 2010  0 17 0.06463837 0.01567335 -63.5621 129.1242
   bicchisq chisqpvalue chisqdf
 rate 129.9574 14.90239 0.001901994   3

 $`355.2010.0`
 NULL

 $`376.2010.0`
  mid year sample samplesize   estimate sdloglik  aic
 rate 376 2010  0  6 0.07228863 0.02950606 -21.76253 45.52506
   bicchisq  chisqpvalue chisqdf
 rate 45.31682 16.46848 4.946565e-05   1


You see the value for rowname = `355.2010.0` is NULL., and it is my guess
that this leads to the error I show above.  But I can't confirm that yet,
because I don't yet know how to get rid of rows that have a row name but
only NULL as the value.

I haven't seen this dealt with in the references I have read so far.

I think I may be able to deal with it by creating dummy values for the
fields the data frame requires, and then use SQL to remove them, but I'd
rather not have to resort to that if I can avoid it.

I can't believe there isn't something in the base package for R that would
easily handle this, but not knowing the name of the function to look at, I
haven't found it yet.

Any information would be appreciated.

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] One problem with RMySQL and a query that returns an empty recordset

2010-07-20 Thread Ted Byers
My last query related to this referred to a problem with not being able to
store data.  A suggestion was made to try to convert the data returned by
fitdist into a data.frame before using rbind.  That failed, but provided the
key to solving the problem (which was to create a data.frame using the
variables fitdist produces in the object it returns).

I now have almost everything working as intended.  However, there is one
problem.

Here is the error:

'data.frame':0 obs. of  0 variables
Error in `[.data.frame`(moreinfo, , 1) : undefined columns selected
Calls: [ - [.data.frame
Execution halted

the curious thing is that this happens when my script is called from within
perl.  Within Rgui, the script continues through to the end, but the loop
that is involved terminates at the line where this error occurs.  The line
that results in this error is:

  moreinfo - dbGetQuery(con, x)

This statement occurs in a loop that ought to iterate over a few hundred
values for m_id (see the SQL below).  Because of the above error, I never
see about two thirds of the results that ought to be produced.

At the time that the error occurs, x contains the following SQL query:

SELECT m_id,sale_date,YEAR(sale_date) AS sale_year,MONTH(sale_date) AS
sale_month,return_type,0.0001 + DATEDIFF(return_date,sale_date) AS
elapsed_time FROM `merchants2`.`risk_input` WHERE m_id = 361 AND return_type
= 1 AND DATEDIFF(return_date,sale_date) IS NOT NULL;

If I execute this SQL, I find the resultset is empty.  So assigning the
value returned by dbGetQuery to moreinfo works ONLY if the resultset is not
empty.  It fails with a fatal error if the resultset is empty.  So, the
question is, how can I revise that statement so that the assignment happens
only if the resultset is NOT empty?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] I need help making a data.fame comprised of selected columns of an original data frame.

2010-07-16 Thread Ted Byers
I must have missed something simple, but still, i don't know what.

I obtained my basic data as follows:

x - sprintf(SELECT m_id,sale_date,YEAR(sale_date) AS
sale_year,WEEK(sale_date) AS sale_week,return_type,0.0001 +
DATEDIFF(return_date,sale_date) AS elapsed_time FROM
`merchants2`.`risk_input` WHERE DATEDIFF(return_date,sale_date) IS NOT
NULL)
moreinfo - dbGetQuery(con, x)

I then made the data frame I want to use as follows:

fun_m_id - function(df)
  if (length(df$elapsed_time)  5) {
rv = fitdist(df$elapsed_time,exp)
rv$mid = df$m_id[1]
rv
  }
aaa - lapply(split(moreinfo,list(moreinfo$m_id),drop = TRUE), fun_m_id)
m_id_default_res - do.call(rbind, aaa)

At this point, each row in m_id_default_res corresponds to one data.frame
produced by fitdist.  When I print it, I get the output I expected.
However, I need to store only some of it into my DB.

And then, because fitdist produces a data frame that includes a lot of info
I don't need to store in the DB, I tried making a new data.frame containing
only the info I need as follows:
ndf = data.frame()
for (i in 1:length(m_id_default_res[,1])) {
  ndf$mid[i] = m_id_default_res$mid[i]
  ndf$estimate[i] = m_id_default_res$estimate[i]
  ndf$sd[i] = m_id_default_res$sd[i]
  ndf$n[i] = m_id_default_res[i]
  ndf$loglik[i] = m_id_default_res$loglik[i]
  ndf$aic[i] = m_id_default_res$aic[i]
  ndf$bic[i] = m_id_default_res$bic[i]
  ndf$chisq[i] = m_id_default_res$chisq[i]
  ndf$chisqpvalue[i] = m_id_default_res$chisqpvalue[i]
  ndf$chisqdf[i] = m_id_default_res$chisqdf[i]
}
ndf

And I get the following error:
Error in `$-.data.frame`(`*tmp*`, n, value = list(0.114752782316094)) :
  replacement has 1 rows, data has 0

I need to either get rid of the columns in m_id_default_res that I don't
need, or I need to copy only those columns I need to a new data.frame.  How
do I do this.  Obviously, doing an element-wise copy, at least as I tried to
do it, doesn't work.

Thanks,

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] I need help making a data.fame comprised of selected columns of an original data frame.

2010-07-16 Thread Ted Byers
Hi Steve,

Thanks

Here is a tiny subset of the data:
 dput(head(moreinfo, 40))
structure(list(m_id = c(171, 206, 206, 206, 206, 206, 206, 218,
224, 224, 227, 229, 229, 229, 229, 229, 229, 229, 229, 233, 233,
238, 238, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251,
251, 251, 251, 251, 251, 251), sale_date = c(2008-04-25 07:41:09,
2008-05-09 20:58:12, 2008-09-06 19:51:52, 2008-05-01 21:26:40,
2008-08-06 23:53:17, 2008-05-29 18:44:50, 2008-05-16 16:10:52,
2008-12-30 17:59:54, 2008-11-06 18:15:40, 2008-09-05 17:43:51,
2008-10-31 21:55:52, 2008-04-30 21:30:36, 2008-11-11 00:43:54,
2008-07-24 22:26:29, 2008-10-07 17:57:22, 2008-04-23 20:39:41,
2008-09-08 22:42:12, 2008-11-13 00:09:59, 2008-04-15 22:57:31,
2008-07-05 08:52:58, 2008-10-04 13:17:02, 2008-03-20 23:02:12,
2008-08-08 16:48:42, 2008-06-04 04:31:20, 2008-09-27 07:02:14,
2008-09-08 07:16:39, 2008-09-25 07:09:11, 2008-09-23 07:02:39,
2008-08-09 07:31:46, 2008-09-28 07:02:13, 2008-07-05 07:26:46,
2008-05-11 04:01:55, 2008-06-26 07:46:17, 2008-07-09 07:36:16,
2008-07-21 18:36:44, 2008-10-11 07:01:36, 2008-07-21 19:03:42,
2008-05-07 04:21:23, 2008-10-14 07:07:02, 2008-05-12 04:26:21
), sale_year = c(2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L,
2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L,
2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L,
2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L,
2008L, 2008L, 2008L, 2008L, 2008L, 2008L), sale_week = c(16L,
18L, 35L, 17L, 31L, 21L, 19L, 52L, 44L, 35L, 43L, 17L, 45L, 29L,
40L, 16L, 36L, 45L, 15L, 26L, 39L, 11L, 31L, 22L, 38L, 36L, 38L,
38L, 31L, 39L, 26L, 19L, 25L, 27L, 29L, 40L, 29L, 18L, 41L, 19L
), return_type = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1), elapsed_time = c(1e-04, 1e-04, 3.0001, 4.0001,
21.0001, 5.0001, 24.0001, 1.0001, 8.0001, 1e-04, 1e-04, 8.0001,
14.0001, 55.0001, 35.0001, 1e-04, 1e-04, 4.0001, 1e-04, 2.0001,
5.0001, 1e-04, 52.0001, 4.0001, 28.0001, 49.0001, 34.0001, 72.0001,
5.0001, 53.0001, 128.0001, 8.0001, 2.0001, 55.0001, 1.0001, 12.0001,
46.0001, 30.0001, 12.0001, 12.0001)), .Names = c(m_id, sale_date,
sale_year, sale_week, return_type, elapsed_time), row.names = c(1,

2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40), class = data.frame)


The full dataset has almost 200,000 observations!  That is why I hadn't
posted the raw data.  And m_id_default_res is even bigger because it
includes all the original data along with the computed stats.


Yes, the following line you pointed out has a typo:

ndf$n[i] = m_id_default_res[i]

It should have been

ndf$n[i] = m_id_default_res$n[i]

Correcting that makes the error go away, but at the end of the loop, ndf is
said to have 0 columns and 0 rows.  That I don't understand.

But your statement (as corrected for the right source name) below does what
I'd intended.
ndf - m_id_default_res[, c('mid', 'estimate', 'sd', 'loglik', 'aic','bic',
'chisq', 'chisqpvalue', 'chisqdf')]

Thanks

Ted


On Fri, Jul 16, 2010 at 12:04 PM, Steve Lianoglou 
mailinglist.honey...@gmail.com wrote:

 Hi,

 First: it's kind of hard to play along w/o some reproducible data. To
 that end, you can paste into an email the output of:

 dput(moreinfo)

 If there are lots of rows in `moreinfo`, just give us the first ~10-20

 dput(head(moreinfo, 20))

 Anyway:

 snip
  At this point, each row in m_id_default_res corresponds to one data.frame
  produced by fitdist.  When I print it, I get the output I expected.
  However, I need to store only some of it into my DB.
 
  And then, because fitdist produces a data frame that includes a lot of
 info
  I don't need to store in the DB, I tried making a new data.frame
 containing
  only the info I need as follows:
  ndf = data.frame()
  for (i in 1:length(m_id_default_res[,1])) {
   ndf$mid[i] = m_id_default_res$mid[i]
   ndf$estimate[i] = m_id_default_res$estimate[i]
   ndf$sd[i] = m_id_default_res$sd[i]
   ndf$n[i] = m_id_default_res[i]
   ndf$loglik[i] = m_id_default_res$loglik[i]
   ndf$aic[i] = m_id_default_res$aic[i]
   ndf$bic[i] = m_id_default_res$bic[i]
   ndf$chisq[i] = m_id_default_res$chisq[i]
   ndf$chisqpvalue[i] = m_id_default_res$chisqpvalue[i]
   ndf$chisqdf[i] = m_id_default_res$chisqdf[i]
  }

 Forget the for loop. How about:

 ndf - m_id_default[, c('mid, 'estimate', 'sd', 'loglik', 'aic',
 'bic', 'chisq', 'chisqpvalue', 'chisqdf')

 Having just written that, I see something strange in your for loop.
 Specifically this line:

   ndf$n[i] = m_id_default_res[i]

 m_id_default_res is a data.frame, right? Why don't you try to see what
 `m_id_default_res[1]` returns.

 I'm not sure that that's what your error message is coming from, but I
 foresee this to be a problem anyway, if I follow your build up code
 correctly.

 Hope that helps,

 --
 Steve Lianoglou
 Graduate Student: Computational Systems 

[R] Elementary question about computing confidence intervals.

2010-07-16 Thread Ted Byers
I would have thought this to be relatively elementary, but I can't find it
mentioned in any of my stats texts.

Please consider the following:

library(fitdistrplus)

  fp = fitdist(y,exp);
  rate = fp$estimate;
  sd = fp$sd
  fOneWeek = exp(-rate*7); #fraction that happens within a week - y is
measured in days
  fr = exp(-rate*dt);  #fraction remaining - dt = elapsed time from time of
sample to present
  fh = 1 - fr;  # fraction that occurred from time of sample to present

# assume n = total number that have happened from time of sample to present
  T = n / fh  # t is the total number at y = 0
  NR = fr * T
  NNW = NR * (1 - fOneWeek)

(If you wanted to run this, just populate y with random numbers from an
exponential distribution.)

What I show here simply extracts an estimate and standard deviation from the
data.frame returned by fitdist, and tries to compute a number of integrals.
What I need is the number of events that can be expected next week, next
month, and from now to the end of time.  Unless I have gone senile in my old
age, I have the integrals correct. Please correct me if I missed something.
But what I need help (to refresh my memory - I used to know this way back in
the stone age) to compute the confidence intervals for each of these
integrals.

So I don't bother anyone with similar elementary questions, what web
resource exists that defines confidence intervals for such integrals for
arbitrary distributions?  or does such a resource exist?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Troubles with DBI's dbWriteTable in RMySQL

2010-07-16 Thread Ted Byers
I am feeling rather dumb right now.

I created what I thought was a data.frame as follows:

aaa - lapply(split(moreinfo,list(moreinfo$m_id),drop = TRUE), fun_m_id)
m_id_default_res - do.call(rbind, aaa)
print(==)
m_id_default_res
print(==)
ndf - m_id_default_res[, c('mid', 'estimate', 'sd', 'loglik', 'aic','bic',
'chisq', 'chisqpvalue', 'chisqdf')]
ndf

The data in NDF is perfect, exactly what I expected when I print the
contents as shown in the last statement above.

On the asumption tha tthat is a data frame, I tried

dbWriteTable(con,test1,ndf);

But I received the following error:

Error in function (classes, fdef, mtable)  :
  unable to find an inherited method for function dbWriteTable, for
signature MySQLConnection, character, matrix

Then, on the assumption it is trivial to convert a matrix into a data.frame,
i tried:

dbWriteTable(con,test2,as.data.frame(ndf));

But this produced the following error:
Error in write.table(x, file, nrow(x), p, rnames, sep, eol, na, dec,
as.integer(quote),  :
  unimplemented type 'list' in 'EncodeElement'

The silly, and frustrating, thing is that I used dbWriteTable before, and
that worked adequately.  But that was with a simple data frame (within a for
loop, element by element - res$var[[i]] = expression), not the result of
do.call(rbind(...))  The principle limitation I saw in my previous use of
dbWriteTable is that all fields are given the type 'TEXT', and that it
insists on creating a new table.  What I'd prefer is a kind of bulk interset
that just makes extra records for an existing table.

So, given my past experience with dbWriteTable, it is a question of what
do.call(rbind(..)) did to produce ndf that has the effect that dbWriteTable
doesn't like that data.frame.

So, then, what is the bext way to either get dbWriteTable working (ideally
in a way that works around the limitations I mention above) or to do a bulk
insert into my MySQL table (yes, I already have a table in the relevant
schema with all the right data types for each field, and I load RMySQL at
the start of my program.)  In a worst case, I can live with an insertion one
record at a time.

Thanks

Ted

PS: If it helps, here is the the contents of ndf - as shown by entering
'ndf' at the R prompt:
 ndf
mid estimate   sd   loglikaic  bic  chisq
chisqpvalue   chisqdf
206 206 0.1147528  0.04336918   -22.15483 46.30965 46.25556 4.433502
0.035240131
229 229 0.0736 0.01999671   -56.41179 114.8236 115.5962 195307.1
0 2
251 251 0.074421   0.002171616  -4224.072 8450.144 8455.212 593302.2
0 18
252 252 0.03710208 0.0004556731 -28426.82 56855.65 56862.45 3543373
0 38
253 253 0.01397349 0.0005900857 -2925.179 5852.358 5856.677 283.9848
5.232282e-51  16
254 254 0.09043846 0.01528502   -119.108  240.216  241.7713 23.52441
3.139385e-05  3
255 255 0.05078883 0.0006021373 -28294.38 56590.76 56597.63 1988844
0 35
260 260 0.03392846 0.005499136  -166.5730 335.1461 336.7837 10.83060
0.054844135
268 268 0.05357114 0.01785082   -35.3407  72.6814  72.87863 82995.79
0 2
286 286 0.09321947 0.01987217   -74.20157 150.4031 151.4942 1.698603
0.6372445 3
290 290 0.03841793 0.006584153  -144.8139 291.6277 293.1541 135.8937
2.902434e-29  3
292 292 0.06289269 0.01988338   -37.66325 77.32651 77.6291  143099.8
0 2
297 297 0.01674874 0.004047625  -86.52035 175.0407 175.8739 47.27713
3.034432e-10  3
302 302 0.02878066 0.003876092  -250.1428 502.2857 504.293  9.22447
0.2369393 7
306 306 0.07904849 0.0004164051 -127449.0 254899.9 254908.4 111574416
0 40
307 307 0.01655872 0.001320903  -795.7314 1593.463 1596.513 57.38622
1.127804e-08  10
308 308 0.02631102 0.000884155  -4095.149 8192.298 8197.081 142.8876
3.904898e-20  21
309 309 0.09891599 0.0084501-453.9474 909.8947 912.8147 357135.5
0 8
310 310 0.09332047 0.004580396  -1399.262 2800.524 2804.552 217126
0 13
311 311 0.06378327 0.0005049166 -59848.62 119699.2 119706.9 59481893
0 34
313 313 0.06203001 0.0006486936 -34546.67 69095.34 69102.46 18207698
0 32
316 316 0.173  0.07026985   -25.04100 52.08199 52.38458 18002.22
0 2
317 317 0.04405086 0.0005949207 -22578.44 45158.88 45165.49 8923236
0 33
320 320 0.05747093 0.006634162  -289.2357 580.4714 582.7889 8.641322
0.2794433 7
321 321 0.06365155 0.003692525  -1115.037 2232.073 2235.767 19.10553
0.0860133712
322 322 0.05737672 0.01532991   -54.01363 110.0273 110.6663 9.597753
0.008238998   2
323 323 0.03116934 0.001909146  -1188.573 2379.146 2382.73  109.7663
6.656046e-18  12
324 324 0.03027327 0.0004146385 -23922.15 47846.3  47852.88 47330365
0 32
325 325 0.06047783 0.00922026   -163.6356 329.2711 331.0323 1695781
0 3
326 326 0.05627898 0.0008642285 -16432.57 32867.13 32873.48 3405089
0 29
327 327 0.07052627 

[R] How do I combine lists of data.frames into a single data frame?

2010-07-15 Thread Ted Byers
The data.frame is constructed by one of the following functions:

funweek - function(df)
  if (length(df$elapsed_time)  5) {
rv = fitdist(df$elapsed_time,exp)
rv$year = df$sale_year[1]
rv$sample = df$sale_week[1]
rv$granularity = week
rv
  }
funmonth - function(df)
  if (length(df$elapsed_time)  5) {
rv = fitdist(df$elapsed_time,exp)
rv$year = df$sale_year[1]
rv$sample = df$sale_month[1]
rv$granularity = month
rv
  }

It is basically the data.frame created by fitdist extended to include the
variables used to distinguish one sample from another.

I have the following statement that gets me a set of IDs from my db:

ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input)

And then I have a loop that allows me to analyze one dataset after another:

for (i in 1:length(ids[,1])) {
  print(i)
  print(ids[i,1])

Then, after a set of statements that give me information about the dataset
(such as its size), within a conditional block that ensures I apply the
analysis only on sufficiently large samples, I have the following:

z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop
= TRUE), funweek)

or z -
lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop =
TRUE), funmonth)

followed by:

str(z)

Of course, I close the loop and disconnect from my db.

NB: I don't see any way to get rid of the loop by adding ID as a factor to
split because I have to query the DB for several key bits of data in order
to determine whether or not there is sufficient data to work on.

I have everything working, except the final step of storing the results back
into the db.  Storing data in the Db is easy enough.  But I am at a loss as
to how to combine the lists placed in z in most of the iterations through
the ID loop into a single data.frame.

Now, I did take a look at rbind and cbind, but it isn't clear to me if
either is appropriate.  All the data frames have the same structure, but the
lists are of variable length, and I am not certain how either might be used
inside the IDs loop.

So, what is the best way to combine all lists assigned to z into a single
data.frame?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How do I combine lists of data.frames into a single data frame?

2010-07-15 Thread Ted Byers
Thanks Marc

The next part of the question, though, involves the fact that there is a new
'z' list made in almost every iteration through the ID loop.

I guess there are two parts to the question.  First, how would I make a list
containing all the data frames created by a call to rbind?  I assume, then,
that I could call rbind again to make that new list into a single
data.frame.  Second, is it possible to just append one list of objects to
another list of objects, and would doing that and calling rbind on that
master list be more efficient than calling rbind on each z list and then
calling rbind after the loop on the list of such data.frames?

Thanks again,

Ted

On Thu, Jul 15, 2010 at 3:27 PM, Marc Schwartz marc_schwa...@me.com wrote:

 On Jul 15, 2010, at 2:18 PM, Ted Byers wrote:

  The data.frame is constructed by one of the following functions:
 
  funweek - function(df)
   if (length(df$elapsed_time)  5) {
 rv = fitdist(df$elapsed_time,exp)
 rv$year = df$sale_year[1]
 rv$sample = df$sale_week[1]
 rv$granularity = week
 rv
   }
  funmonth - function(df)
   if (length(df$elapsed_time)  5) {
 rv = fitdist(df$elapsed_time,exp)
 rv$year = df$sale_year[1]
 rv$sample = df$sale_month[1]
 rv$granularity = month
 rv
   }
 
  It is basically the data.frame created by fitdist extended to include the
  variables used to distinguish one sample from another.
 
  I have the following statement that gets me a set of IDs from my db:
 
  ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input)
 
  And then I have a loop that allows me to analyze one dataset after
 another:
 
  for (i in 1:length(ids[,1])) {
   print(i)
   print(ids[i,1])
 
  Then, after a set of statements that give me information about the
 dataset
  (such as its size), within a conditional block that ensures I apply the
  analysis only on sufficiently large samples, I have the following:
 
  z -
 lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop
  = TRUE), funweek)
 
  or z -
  lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop =
  TRUE), funmonth)
 
  followed by:
 
  str(z)
 
  Of course, I close the loop and disconnect from my db.
 
  NB: I don't see any way to get rid of the loop by adding ID as a factor
 to
  split because I have to query the DB for several key bits of data in
 order
  to determine whether or not there is sufficient data to work on.
 
  I have everything working, except the final step of storing the results
 back
  into the db.  Storing data in the Db is easy enough.  But I am at a loss
 as
  to how to combine the lists placed in z in most of the iterations through
  the ID loop into a single data.frame.
 
  Now, I did take a look at rbind and cbind, but it isn't clear to me if
  either is appropriate.  All the data frames have the same structure, but
 the
  lists are of variable length, and I am not certain how either might be
 used
  inside the IDs loop.
 
  So, what is the best way to combine all lists assigned to z into a single
  data.frame?
 
  Thanks
 
  Ted


 Ted,

 If each of the data frames in the list 'z' have the same column structure,
 you can use:

  do.call(rbind, z)

 The result of which will be a single data frame containing all of the rows
 from each of the data frames in the list.

 HTH,

 Marc Schwartz




-- 
R.E.(Ted) Byers, Ph.D.,Ed.D.
t...@merchantservicecorp.com
CTO
Merchant Services Corp.
350 Harry Walker Parkway North, Suite 8
Newmarket, Ontario
L3Y 8L3

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How do I combine lists of data.frames into a single data frame?

2010-07-15 Thread Ted Byers
 Byers wrote:

  Thanks Marc
 
  The next part of the question, though, involves the fact that there is a
 new
  'z' list made in almost every iteration through the ID loop.
 
  I guess there are two parts to the question.  First, how would I make a
 list
  containing all the data frames created by a call to rbind?  I assume,
 then,
  that I could call rbind again to make that new list into a single
  data.frame.  Second, is it possible to just append one list of objects to
  another list of objects, and would doing that and calling rbind on that
  master list be more efficient than calling rbind on each z list and then
  calling rbind after the loop on the list of such data.frames?
 
  Thanks again,
 
  Ted
 
  On Thu, Jul 15, 2010 at 3:27 PM, Marc Schwartz marc_schwa...@me.com
 wrote:
 
  On Jul 15, 2010, at 2:18 PM, Ted Byers wrote:
 
  The data.frame is constructed by one of the following functions:
 
  funweek - function(df)
  if (length(df$elapsed_time)  5) {
rv = fitdist(df$elapsed_time,exp)
rv$year = df$sale_year[1]
rv$sample = df$sale_week[1]
rv$granularity = week
rv
  }
  funmonth - function(df)
  if (length(df$elapsed_time)  5) {
rv = fitdist(df$elapsed_time,exp)
rv$year = df$sale_year[1]
rv$sample = df$sale_month[1]
rv$granularity = month
rv
  }
 
  It is basically the data.frame created by fitdist extended to include
 the
  variables used to distinguish one sample from another.
 
  I have the following statement that gets me a set of IDs from my db:
 
  ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input)
 
  And then I have a loop that allows me to analyze one dataset after
  another:
 
  for (i in 1:length(ids[,1])) {
  print(i)
  print(ids[i,1])
 
  Then, after a set of statements that give me information about the
  dataset
  (such as its size), within a conditional block that ensures I apply the
  analysis only on sufficiently large samples, I have the following:
 
  z -
  lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop
  = TRUE), funweek)
 
  or z -
  lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop
 =
  TRUE), funmonth)
 
  followed by:
 
  str(z)
 
  Of course, I close the loop and disconnect from my db.
 
  NB: I don't see any way to get rid of the loop by adding ID as a factor
  to
  split because I have to query the DB for several key bits of data in
  order
  to determine whether or not there is sufficient data to work on.
 
  I have everything working, except the final step of storing the results
  back
  into the db.  Storing data in the Db is easy enough.  But I am at a
 loss
  as
  to how to combine the lists placed in z in most of the iterations
 through
  the ID loop into a single data.frame.
 
  Now, I did take a look at rbind and cbind, but it isn't clear to me if
  either is appropriate.  All the data frames have the same structure,
 but
  the
  lists are of variable length, and I am not certain how either might be
  used
  inside the IDs loop.
 
  So, what is the best way to combine all lists assigned to z into a
 single
  data.frame?
 
  Thanks
 
  Ted
 
 
  Ted,
 
  If each of the data frames in the list 'z' have the same column
 structure,
  you can use:
 
  do.call(rbind, z)
 
  The result of which will be a single data frame containing all of the
 rows
  from each of the data frames in the list.
 
  HTH,
 
  Marc Schwartz
 
 
 
 
  --
  R.E.(Ted) Byers, Ph.D.,Ed.D.
  t...@merchantservicecorp.com
  CTO
  Merchant Services Corp.
  350 Harry Walker Parkway North, Suite 8
  Newmarket, Ontario
  L3Y 8L3
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.




-- 
R.E.(Ted) Byers, Ph.D.,Ed.D.
t...@merchantservicecorp.com
CTO
Merchant Services Corp.
350 Harry Walker Parkway North, Suite 8
Newmarket, Ontario
L3Y 8L3

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] exercise in frustration: applying a function to subsamples

2010-07-12 Thread Ted Byers
From the documentation I have found, it seems that one of the functions from
package plyr, or a combination of functions like split and lapply would
allow me to have a really short R script to analyze all my data (I have
reduced it to a couple hundred thousand records with about half a dozen
records.

I get the same result from ddply and split/lapply:

 ddply(moreinfo,c(m_id,sale_year,sale_week),
 +   function(df) data.frame(res = fitdist(df$elapsed_time,exp),est =
 res$estimate,sd = res$sd))
 Error in fitdist(df$elapsed_time, exp) :
   data must be a numeric vector of length greater than 1


and


 lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)),
 +   function(df) fitdist(df$elapsed_time,exp))
 Error in fitdist(df$elapsed_time, exp) :
   data must be a numeric vector of length greater than 1


Now, in retrospect, unless I misunderstood the properties of a data.frame, I
suppose a data.frame might not have been entirely appropriate as the m_id
samples start and end on very different dates, but I would have thought a
list data structure should have been able to handle that.  It would seem
that split is making groups that have the same start and end dates (or that
if, for example, I have sale data for precisely the last year, split would
insist on both 2009 and 2010 having weeks from 0 through 52 instead of just
the weeks in each year that actually have data: 26 through 52 for last year
and 1 through 25 for this year).  I don't see how else the data passed to
fitdist could have a sample size of 0.

I'd appreciate understanding how to resolve this.  However, it isn't s show
stopper as it now seems trivial to just break it out into a loop (followed
by a lapply/split combo using only sale year and sale month).

While I am asking, is there a better way to split such temporally ordered
data into weekly samples that respective the year in which the sample is
taken as well as the week in which it is taken?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] exercise in frustration: applying a function to subsamples

2010-07-12 Thread Ted Byers
OK,  here is a stripped down variant of my code.  I can run it here
unchanged (apart from the credentials for connecting to my DB).

Sys.setenv(MYSQL_HOME='C:/Program Files/MySQL/MySQL Server 5.0')
 library(TSMySQL)
 library(plyr)
 library(fitdistrplus)
 con - dbConnect(MySQL(), user=rejbyers, password=jesakos,
 dbname=merchants2)
 x - sprintf(SELECT m_id,sale_date,YEAR(sale_date) AS
 sale_year,WEEK(sale_date) AS sale_week,return_type,0.0001 +
 DATEDIFF(return_date,sale_date) AS elapsed_time FROM `risk_input` WHERE
 DATEDIFF(return_date,sale_date) IS NOT NULL)
 x
 moreinfo - dbGetQuery(con, x)
 str(moreinfo)
 #moreinfo
 #print(moreinfo)
 dbDisconnect(con)
 f1 - fitdist(moreinfo$elapsed_time,exp);
 summary(f1)
 lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week),drop
 = TRUE),
   function(df) fitdist(df$elapsed_time,exp))


I guess that for others to run this script, it is just necessary to create
some sample data, consisting of two or more m_id values (I have several
hundred), and temporally ordered data for each.  I am not familiar enough
with R to know how to do that using R.Usually, if I need dummy data, I
make it with my favourite rng using either C++ or Perl.  I am still trying
to get used to R.

Each record in my data has one random variate and a MySQL TIMESTAMP
(nn-nn- nn:nn:nn), anywhere from hundreds to thousands each week for
anywhere from a few months to several years.  My SQL actually produces the
random variate by taking the difference between the sale date and return
date, and is structured as it is because I know how to group by year and
week from a timestamp field using SQL but didn't know how to accomplish the
same thing in R.

The statement 'x' by itself, always shows me the correct SQL statement to
get the data (I can execute it unchanged in the mysql commandline client).
'str(moreinfo)' always gives me the data structure I expect.  E.g.:

 str(moreinfo)
'data.frame':   177837 obs. of  6 variables:
 $ m_id: num  171 206 206 206 206 206 206 218 224 224 ...
 $ sale_date   : chr  2008-04-25 07:41:09 2008-05-09 20:58:12
2008-09-06 19:51:52 2008-05-01 21:26:40 ...
 $ sale_year   : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
 $ sale_week   : int  16 18 35 17 31 21 19 52 44 35 ...
 $ return_type : num  1 1 1 1 1 1 1 1 1 1 ...
 $ elapsed_time: num  0.0001 0.0001 3.0001 4.0001 21.0001 ...

'summary(f1)' shows me the results I expect from the aggregate data.  E.g.:

 summary(f1)
FITTING OF THE DISTRIBUTION ' exp ' BY MAXIMUM LIKELIHOOD
PARAMETERS
  estimate   Std. Error
rate 0.0652917 0.0001547907
Loglikelihood:  -663134.7   AIC:  1326271   BIC:  1326281
--
GOODNESS-OF-FIT STATISTICS

_ Chi-squared_
Chi-squared statistic:  400277239
Degree of freedom of the Chi-squared distribution:  56
Chi-squared p-value:  0
!!! the p-value may be wrong
  with some theoretical
counts  5 !!!

!!! For continuous distributions, Kolmogorov-Smirnov and
  Anderson-Darling statistics should be prefered !!!

_ Kolmogorov-Smirnov_
Kolmogorov-Smirnov statistic:  0.1660987
Kolmogorov-Smirnov test:  rejected
!!! The result of this test may be too conservative as it
 assumes that the distribution parameters are known !!!

_ Anderson-Darling_
Anderson-Darling statistic:  Inf
Anderson-Darling test:  rejected


And at the end, I get the error I mentioned.  NB: In this variant, I added
drop = TRUE as Jim suggested.


lapply(split(all_samples,list(all_samples$m_id,all_samples$sale_year,all_samples$sale_week),drop
= TRUE),
+   function(df) fitdist(df$elapsed_time,exp))
Error in fitdist(df$elapsed_time, exp) :
  data must be a numeric vector of length greater than 1

If, then, drop = TRUE results in all empty combinations of m_id, year and
week being excluded, then (noticing the requirement is actually that the
sample size be greater than 1), I can only conclude that at least one of the
samples has only 1 record.

But that is too small.  Is there a way to allow the above code to apply
fitdist only if the sample size of a given subsample is greater than, say,
100?

Even better, is there a way to make the split more dynamic, so that it
groups a given m_id's data by month if the average weekly subsample size is
less than 100, or by day if the average weekly subsample is greater than
1000?

Thanks

Ted


On Mon, Jul 12, 2010 at 3:20 PM, Erik Iverson er...@ccbr.umn.edu wrote:

 Your code is not reproducible.  Can you come up with a small example
 showing the crux of your data structures/problem, that we can all run in our
 R sessions?  You're likely get much higher quality responses this way.

 Ted Byers wrote:

 From the documentation I have found, it seems that one of the functions
 from

 package plyr, or a combination of functions like split and lapply would
 allow me to have a really short R script to analyze all my data (I have

Re: [R] exercise in frustration: applying a function to subsamples

2010-07-12 Thread Ted Byers
Thanks Jim,

I acted on your suggestion and found the result unchanged.  :-(  Then I
noticed that fitdist doesn't like a sample size of 1 either.

If, then, drop = TRUE results in all empty combinations of m_id, year and
week being excluded, then (noticing the requirement is actually that the
sample size be greater than 1), I can only conclude that at least one of the
samples has only 1 record. I hadn't realized that some of the subsamples
were that small.  In my reply to Erik, I wrote:

But that is too small.  Is there a way to allow the above code to apply
 fitdist only if the sample size of a given subsample is greater than, say,
 100?  Even better, is there a way to make the split more dynamic, so that it
 groups a given m_id's data by month if the average weekly subsample size is
 less than 100, or by day if the average weekly subsample is greater than
 1000?


Thanks

Ted

On Mon, Jul 12, 2010 at 4:02 PM, jim holtman jholt...@gmail.com wrote:

 try 'drop=TRUE' on the split function call.  This will prevent the
 NULL set from being sent to the function.

 On Mon, Jul 12, 2010 at 3:10 PM, Ted Byers r.ted.by...@gmail.com wrote:
  From the documentation I have found, it seems that one of the functions
 from
  package plyr, or a combination of functions like split and lapply would
  allow me to have a really short R script to analyze all my data (I have
  reduced it to a couple hundred thousand records with about half a dozen
  records.
 
  I get the same result from ddply and split/lapply:
 
  ddply(moreinfo,c(m_id,sale_year,sale_week),
  +   function(df) data.frame(res = fitdist(df$elapsed_time,exp),est
 =
  res$estimate,sd = res$sd))
  Error in fitdist(df$elapsed_time, exp) :
data must be a numeric vector of length greater than 1
 
 
  and
 
 
 
 lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)),
  +   function(df) fitdist(df$elapsed_time,exp))
  Error in fitdist(df$elapsed_time, exp) :
data must be a numeric vector of length greater than 1
 
 
  Now, in retrospect, unless I misunderstood the properties of a
 data.frame, I
  suppose a data.frame might not have been entirely appropriate as the m_id
  samples start and end on very different dates, but I would have thought a
  list data structure should have been able to handle that.  It would seem
  that split is making groups that have the same start and end dates (or
 that
  if, for example, I have sale data for precisely the last year, split
 would
  insist on both 2009 and 2010 having weeks from 0 through 52 instead of
 just
  the weeks in each year that actually have data: 26 through 52 for last
 year
  and 1 through 25 for this year).  I don't see how else the data passed to
  fitdist could have a sample size of 0.
 
  I'd appreciate understanding how to resolve this.  However, it isn't s
 show
  stopper as it now seems trivial to just break it out into a loop
 (followed
  by a lapply/split combo using only sale year and sale month).
 
  While I am asking, is there a better way to split such temporally ordered
  data into weekly samples that respective the year in which the sample is
  taken as well as the week in which it is taken?
 
  Thanks
 
  Ted
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?




-- 
R.E.(Ted) Byers, Ph.D.,Ed.D.
t...@merchantservicecorp.com
CTO
Merchant Services Corp.
350 Harry Walker Parkway North, Suite 8
Newmarket, Ontario
L3Y 8L3

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] I need guidance on better data management in preparation for time series analysis

2010-06-30 Thread Ted Byers
OK, I have managed to use some of the basic processes of getting data from
my DB, passing it as a whole to something like fitdistr, c.  I know I can
implement most of what I need using a brute force algorithm based on a
series of nested loops.  I also know I can handle some of this logic in a
brute force method using a blend of perl and R, with considerable file IO.
But some of what I need needs a smarter/faster way.

To understand what I am after, consider the following.  I have transaction
data comprised of sales and refunds, each of which has a timestamp.  The
refund data has a timestamp representing when the refund was issued and an
original transaction ID representing the sale it refunds.  I have massaged
this data in my schema so that there is a table that has a record for each
refund, and this record includes, among other things, the timestamps for
both the original sale and the refund.  I can construct a SQL query to get
these along with the elapsed time (in days, as a real number) between the
sale and refund.  For some merchants, I have such data going back years.  I
know, fromt he amount of data I have examined, the rate at which sales
result in refunds changes through time, though I have not run tests to
determine whether or not the changes I see are significant.  In most cases,
I can break the data for a merchant into weekly subsamples.

Obviously, I can construct loops that iterate over merchant ID, and
year/week (or day) covering the entire period for which I have data for a
given merchant.  What I am asking is, Is there a smarter way?

I can't load all the data as there are many GB of data, but the data for
individual merchants varies from a few hundred kB to a few dozen MB.  Thus,
I expect an outer loop iterating over merchant ID will be inevitable.

But, is there a smarter way to apply fitdistr (or similar function) to
samples represent sales in each week of each year (or each day of the year
when there is sufficient data), and then test to see if the parameter of the
exponential distribution that best fits the data varies significantly
through time (there are both theoretical and empirical reasons to expect an
exponential distribution, but the specific distribution doesn't really
matter for the purpose of this question).  That is one question I need to
deal with.  Is there a simple way to specify a function, a dataset and a
rule for determining all the subsamples, and then tell R to apply the
function to each subsample and then say whether or not the estimated
parameters for the subsample are significantly different?  Or do I have to
resort to the simple brute force approach of using a set of nested loops to
get what I need?

The other question I have at present is more a statistical question:
Integrating an exponential pdf over a given time period is simple enough,
but I need to learn how confidence intervals for that integral to be
computed when you have the estimate and std of the parameter for the
exponential distribution from something like fitdistr.  This gets to how to
get confidence intervales when dealing with integrals of functions of
uncertain numbers.  Not only is there a confidence interval for the
parameter of the exponential distribution, but to estimate how many refunds
to expect for the next week, one not only needs the confidence intervals of
the integral of the pdf over the next week for a given sample, but one needs
to integrate this over all the samples that could produce a refund in the
coming week.

I'd appreciate any information anyone can provide, even if that consists of
an URL that points to a resource that deals with the specific questions I
have.  I am afraid all the resources I have found searching so far have been
at a more introductory level of simply making a connection to a DB and then
submitting a SQL statement to it.  Something in between that level and the
level comprised of the maze of documentation for the plethora of relevant
packages is needed here (there is such an embarrassment of riches, I find
myself getting confused as to how to proceed).

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can RMySQL be used for a paramterized query?

2010-06-10 Thread Ted Byers
Thanks,

Actually, thanks to the info Henrique sent, I have made decent progress.

In actuality, I could have just submitted a SELECT * on the second table,
which would give me everything, just like Henrique's suggestion, and yours,
would give me.  The problem is that that table is HUGE (I don't want to load
ALL that data at once, especially when I'd be analyzing it in chunks defined
by ID and date), and at the same time, the analyses would be done not only
one ID at a time, but on records pertaining to a given day. (e.g., imagine a
dataset containing al sales and refund data, and assuming rates at which
sales end in refunds vary through time, something I know from previous
analyses of similar data, I would need to analyze all refunds for sales that
happened on a given day).

While I was aware I could use RMySQL to get my time series data (I will be
assessing a VAR on a 3D time series once my current task is done), I looked
at TSMySQL because, being relatively inexperienced with R, I need to be able
to do a variety of autoregressive analyses.  Someone has suggested I also
look at state space modelling, but being a mathematical ecologst by
training, I am struggling with that along with Kalman Filtering.  But that
is another post ...

Thanks again

Ted

On Thu, Jun 10, 2010 at 5:51 PM, Paul Gilbert 
pgilb...@bank-banque-canada.ca wrote:

 Ted

 I'm not sure I fully understand the question, but you may want to consider
 creating a temporary table with a join, which you can do with a query from
 your R session, and then query that table to bring the data into R. Roughly,
 the logic is to leave the data in the db if you are not doing any fancy
 calculations. You might also find order by is useful. (This is what I use
 in TSdbi to make sure data comes back in the right order as a time series.)
 It may even be possible to get everything you want back in one step using
 this and group by, rather than looping, but depending on the analysis you
 want to do in R, that may not be the most convenient way.

 BTW, I think you realize you do not have to use the TSMySQL commands to
 access the TSMySQL database. They are usually convenient, but you can query
 the tables directly with RMySQL functions.

 Paul

 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
 On Behalf Of Henrique Dallazuanna
 Sent: June 10, 2010 8:47 AM
 To: Ted Byers
 Cc: R-help Forum
 Subject: Re: [R] Can RMySQL be used for a paramterized query?
 
 I think you can do this:
 
 ids - dbGetQuery(conn, SELECT id FROM my_table) other_table -
 dbGetQuery(conn, sprintf(SELECT * FROM my_other_table WHERE t1_id in
 (%s), paste(ids, collapse = ,)))
 
 On Wed, Jun 9, 2010 at 11:24 PM, Ted Byers r.ted.by...@gmail.com
 wrote:
 
  I have not found anything about this except the following from the DBI
  documentation :
 
  Bind variables: the interface is heavily biased towards queries, as
  opposed
   to general
   purpose database development. In particular we made no attempt to
   define bind variables; this is a mechanism by which the contents
   of R/S objects are implicitly moved to the database during SQL
   execution. For instance, the following embedded SQL statement
   /* SQL */
   SELECT * from emp_table where emp_id = :sampleEmployee would take
   the vector sampleEmployee and iterate over each of its
  elements
   to get the result. Perhaps the DBI could at some point in the future
   implement this feature.
  
 
  I can connect, and execute a SQL query such as SELECT id FROM
  my_table, and display a frame with all the IDs from my_table.  But I
  need also to do something like SELECT * FROM my_other_table WHERE
  t1_id = x  where 'x' is one of the IDs returned by the first select
  statement.  Actually, I have to do this in two contexts, one where the
  data are not ordered by time and one where it is (and thus where I'd
  have to use TSMySQL to execute something like SELECT
 record_datetime,value FROM my_ts_table WHERE t2_id = x).
 
  I'd like to embed this in a loop where I iterate over the IDs returned
  by the first select, get the appropriate data from the second for each
  ID, analyze that data and store results in another table in the DB,
  and then proceed to the next ID in the list.  I suppose an alternative
  would be to get all the data at once, but the resulting resultset
  would be huge, and I don't (yet) know how to take a subset of the data
  in a frame based on a given value in one ot the fields and analyze
  that.  Can you point me to an example of how this is done, or do I
  have to use a mix of perl (to get the
  data) and R (to do the analysis)?
 
  Any insights on how to proceed would be appreciated.  Thanks.
 
  Ted
 
 [[alternative HTML version deleted]]
 
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html

[R] Can RMySQL be used for a paramterized query?

2010-06-09 Thread Ted Byers
I have not found anything about this except the following from the DBI
documentation :

Bind variables: the interface is heavily biased towards queries, as opposed
 to general
 purpose database development. In particular we made no attempt to define
 “bind
 variables”; this is a mechanism by which the contents of R/S objects are
 implicitly
 moved to the database during SQL execution. For instance, the following
 embedded SQL statement
 /* SQL */
 SELECT * from emp_table where emp_id = :sampleEmployee
 would take the vector sampleEmployee and iterate over each of its elements
 to get the result. Perhaps the DBI could at some point in the future
 implement
 this feature.


I can connect, and execute a SQL query such as SELECT id FROM my_table,
and display a frame with all the IDs from my_table.  But I need also to do
something like SELECT * FROM my_other_table WHERE t1_id = x  where 'x' is
one of the IDs returned by the first select statement.  Actually, I have to
do this in two contexts, one where the data are not ordered by time and one
where it is (and thus where I'd have to use TSMySQL to execute something
like SELECT record_datetime,value FROM my_ts_table WHERE t2_id = x).

I'd like to embed this in a loop where I iterate over the IDs returned by
the first select, get the appropriate data from the second for each ID,
analyze that data and store results in another table in the DB, and then
proceed to the next ID in the list.  I suppose an alternative would be to
get all the data at once, but the resulting resultset would be huge, and I
don't (yet) know how to take a subset of the data in a frame based on a
given value in one ot the fields and analyze that.  Can you point me to an
example of how this is done, or do I have to use a mix of perl (to get the
data) and R (to do the analysis)?

Any insights on how to proceed would be appreciated.  Thanks.

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] TS model

2010-06-08 Thread Ted Byers
I am looking at a new project involving time series analysis.  I know I can
complete the tasks involving VARMA using either dse or mAr (and I think
there are a couple others that might serve).

However, there is one task that I am not sure of the best way to proceed.

A simple example illustrates what I am after.  If you think of a simple
ballistic problem, with a vector describing current position in 3
dimensions, the components of that vector are simple functions of initial
position, initial velocity (constants, for our purposes) and time.  It is
trivial calculus to compute these values at arbitrary time using only
initial conditions and time.  Of course, for such a simple problem, we know
the equations of motion that we can use for this purpose.

I want to use time series values to estimate a suitable vector valued
function of time in a case where we know neither the equations of change nor
the initial conditions (but where we have daily values going back many
years).  Actually, I don't really care much about the details of the
function nearly as much as the first and second derivatives of the function
with respect to time; and these derivatives have to be inferred from the
model of the measurements as 'simle' functions of time.  And as I do not
want to assume the system is autonomous, I want to be able to repeat the
analysis on a moving window wherein always the current day is designated as
having s = 0 (I.E. the time variable used in the model estimated slides
along that representing real time).  I figure that if that window is short
enough, a quadratic or cubic function of time will suffice.  Finally, if the
combination of first and second derivatives indicates that the first
derivative will take a value of 0 at some point in the future, I want to
estimate the number of days until that happens.  (yes, I know I will need
some sort of orthogonalization of the time variable in order to reduce
problems of multicollinearity, but that I'd expect in any multivariate
nonlinear regression).

I don't know if this could be recast as a VARMA problem, or if so, how and
how I'd get the answers to the questions of importance to me.  I would
welcome  being enlightened on this, if there is an answer.

The question is, Is there a package that already provides support for this
'out of the box', as it were, and if so which one, or do I have to construct
code supporting it de novo?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] What does this warning mean: DLL attempted to change FPU control word from 8001f to 9001f

2010-05-14 Thread Ted Byers
I started a brand new session in R 2.10.1 (on Windows).
If it matters, I am running the community edition of MySQL 5.0.67, and it is
all running fine.

I am just beginning to examine the process of getting timer series data from
one table in MySQL, computing moving averages and computing a selection of
estimates based on relations among moving averages of different variates,
and storing all the results in another table in MySQL.

The very first thing I did in this session was execute the following two
commands:

Sys.setenv(MYSQL_HOME='c:/MySQL')
library(RMySQL)

The output I got was:

Loading required package: DBI
Warning message:
In inDL(x, as.logical(local), as.logical(now), ...) :
  DLL attempted to change FPU control word from 8001f to 9001f

Now, I write programs in relatively high level languages (C++, perl, Java,
and now R), and NEVER even consider twiddling with FPU control words or
playing with registers on the processor.  I have never gotten this close to
the hardware since I messed with video memory in the old days when I wrote
computer based teaching materials on DOS and had to get acceptable
performance out of the hardware available way back then..  Consequently, I
have no idea what this warning means or what I ought to do about it.  I
assume the DLL it is referring to is
libmySQL.dllhttp://www.stat.berkeley.edu/classes/s133/libmySQL.dll,
which RMySQL needs.  But I have no idea either why it would do what R says
it is doing or why it matters to me, or what I ought to do about it.

I'd appreciate any info you can provide.

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Upgrade process for libraries: can I use installed.packages on an old installation followed by install.packages in a new one

2010-04-29 Thread Ted Byers
I tend to have a lot of packages installed, in part because of a wide
diversity of interests and a disposition of examining different ways to
accomplish a given task.

I am looking for a better way to upgrade all my packages when I upgrade the
version of R that I am running.

On looking at support for installing and updating packages, I found these
two: installed.packages() and  install.packages() and it occurred to me that
in principle I ought to be able to use the one in the original installation
to get a list of packages I'm working with and and put its output into a
plain text file that I can read in the new installation and pass to the
other to ensure the new installation has a fresh installation of all the
packages I want to work with.

The question comes WRT the fact the output from installed.packages() does
not coincide with the expected input for install.packages().  What would you
recommend I do to select from the output from the former so the file I write
that output to will have the information the latter wants for input?  For
example, will it work properly if I just write the package names
installed.packages() returns to the file and ignore all the rest?  It is not
clear to me how I'd have it ignore those packages that are part of the core
of R (or even if I need to worry about that - I did see some packages listed
in the output from installed.packages() that are identified as being part of
R 2.10.1, when I looked at using this procedure to set up R 2.11.0).

NB: I am not suggesting the output from the one should coincide with the
expected input for the other.  Rather, I am asking advice on writing simple
R scripts that I can run in the one to get a file that would be suitable
input for the other that would together make a fresh installation of a new
version automatically make a fresh installation of all the previously
installed packages.

Thanks

Ted

PS: I am using Windows XP, if that matters.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Upgrade process for libraries: can I use installed.packages on an old installation followed by install.packages in a new one

2010-04-29 Thread Ted Byers
When doing a fresh install of a new version of R, using update.packages()
requires copying some of the contents of the library subdirectory to the new
installation.  While possible and viable, it can be problematic in being
tedious (more an irritation regarding how Windows handles copying
directories from one location to another when there are already things in
the target directory with the same names, than anything else), and there
exists the possibility that there are some old packages that are obsolete
and won't work properly in the new version.

I don't suppose update.packages() will remove obsolete packages in the
library directory if it finds them, does it?

I have a preference for trying to do a fresh install of a given product's
optional packages (so if a given package has a problem in the new version,
it just doesn't install - rather than cluttering its directory tree with
useless stuff); something that is trivially easy if looking at only a
handful of optional packages but very tedious when there are so many.  I
know from experience that repeatedly having a large, complex piece of
software, whether a major application (like R or MS Word, c.) or an OS like
Windows, update/over-write key part of itself will eventually lead to hard
to diagnose problems.  It is often good to have more than one way to
accomplish a given task, and there are usually many options to choose from
when designing/implementing software.

Actually, with the benefit of 20/20 hindsight, if I had been asked to write
a update.packages() function, I would have had it look in the registery on
Windows, or in the directory tree, for evidence of an older version of R
(perhaps a version that is used only during a fresh install of R), and have
it process the list of detected packages and install/upgrade any packages
that will work with the new version of R, and perhaps, if a given obsolete
package has been superceded by something else, make sure that 'something
else' is installed instead, just so the directory tree for the new install
is not cluttered with old, potentially broken, stuff.

Thanks

Ted

On Thu, Apr 29, 2010 at 4:59 PM, Erik Iverson er...@ccbr.umn.edu wrote:



 Ted Byers wrote:

 I tend to have a lot of packages installed, in part because of a wide
 diversity of interests and a disposition of examining different ways to
 accomplish a given task.

 I am looking for a better way to upgrade all my packages when I upgrade
 the
 version of R that I am running.

 On looking at support for installing and updating packages, I found these
 two: installed.packages() and  install.packages() and it occurred to me
 that
 in principle I ought to be able to use the one in the original
 installation
 to get a list of packages I'm working with and and put its output into a
 plain text file that I can read in the new installation and pass to the
 other to ensure the new installation has a fresh installation of all the
 packages I want to work with.


 I must be missing the obvious, but what's wrong with update.packages() ?




-- 
R.E.(Ted) Byers, Ph.D.,Ed.D.
t...@merchantservicecorp.com
CTO
Merchant Services Corp.
350 Harry Walker Parkway North, Suite 8
Newmarket, Ontario
L3Y 8L3

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Error loading RMySQL

2010-04-28 Thread Ted Byers
I have R 2.10.1 and 2.9.1 installed, and both have RMySQL packages
installed.

I script I'd developed using an older version (2.8.?, I think) used RMySQL
too and an older version of MySQL (5.0.?), and worked fine at that time
(about a year and a half ago +/- a month or two).

But now, when I run it again, on new data, the script works fine until it is
supposed to store its results in my DB.  In actuality, I have a perl script
that does an initial preparation of the data, and then invokes the R script
- all of which work fine).

The very last five lines of my script are:

print(resultsdataframe);
library(RMySQL);
con -
dbConnect(MySQL(),user=rejbyers,password=jesakos,dbname=merchants2);
dbWriteTable(con,results,resultsdataframe);
dbDisconnect(con);


The print statement works fine, and shows me the results I expected.  But
library(RMySQL) fails, which makes all the rest of the lines fail.  The
command and error message is:

 library(RMySQL);
 Error in fun(...) :
   A MySQL Registry key was found but the folder C:\Program
 Files\MySQL\MySQL Administrator 1.1\/. doesn't contain a bin or lib/opt
 folder. That's where we need to find libmySQL.dll.
 Error : .onLoad failed in 'loadNamespace' for 'RMySQL'
 Error: package/namespace load failed for 'RMySQL'


It IS true that the folder C:\Program Files\MySQL\MySQL Administrator 1.1\/.
doesn't contain a bin or lib/opt folder (There is no trailing 'V' in that
path name!).  However, libmySQL.dll actually is there in 'C:\Program
Files\MySQL\MySQL Administrator 1.1'.   It is also true that MySQL 5.0.67
(community edition) is installed and all the related tools (administrator,
browser, c.) work just fine.  (NB: I never touch the registry unless
absolutely necessary, so I don't know what R is looking at in there or if
some misbehaved install program left bogus data there for R to find.)

The question is, why is R looking in the wrong place for this DLL and what
is the best way to solve this problem?

I know a quick and dirty solution, to work around this, is to create that
path and put a copy of the DLL there, but that does not strike me as
adequate.  I would expect that to possibly generate problems the next time I
upgrade MySQL.

So, then, what would you recommend?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] The time series analysis functions/packages don't seem to like my data

2009-07-03 Thread Ted Byers
I have hundreds of megabytes of price data time series, and perl
scripts that extract it to tab delimited files (I have C++ programs
that must analyse this data too, so I get Perl to extract it rather
than have multiple connections to the DB).

I can read the data into an R object without any problems.

thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
= FALSE, na.strings=)
thedata

The above statements give me precisely what I expect.  The last few
lines of output are:
8190 2009-06-16 49.30
8191 2009-06-17 48.40
8192 2009-06-18 47.72
8193 2009-06-19 48.83
8194 2009-06-22 46.85
8195 2009-06-23 47.11
8196 2009-06-24 46.97
8197 2009-06-25 47.43

I have loaded Rmetrics and PerformanceAnalytics, among other packages.
 I tried as.timeseries, but R2.9.1 tells me there is no such function.
I tried as.ts(thedata), but that only replaces the date field by the
row label in 'thedata'.

If I apply the performance analytics drawdowns function to either
thedata or thedate$V2, I get errors:
 table.Drawdowns(thedata,top = 10)
Error in 1 + na.omit(x) : non-numeric argument to binary operator
 table.Drawdowns(thedata$V2, top = 10)
Error in if (thisSign == priorSign) { :
  missing value where TRUE/FALSE needed


thedata$V2 by itself does give me the price data from the file.

I am a relative novice in using R for timeseries, so I wouldn't be
surprised it I missed something that would be obvious to someone more
practiced in using R, but I don't see what that could be from the
documentation of the functions I am looking at using.  I have no
shortage of data, and I don't want to write C++ code, or perl code, to
do all the kinds of calculations provided in, Rmetrics and
performanceanalytics, but getting my data into the functions these
packages provide is killing me!

What did I miss?

Thanks

Ted

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The time series analysis functions/packages don't seem to like my data

2009-07-03 Thread Ted Byers
Hi Mark

Thanks for replying.

Here is a short snippet that reproduces the problem:

library(PerformanceAnalytics)
thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
= FALSE, na.strings=)
thedata
x = as.timeseries(thedata)
x
table.Drawdowns(thedata,top = 10)
table.Drawdowns(thedata$V2, top = 10)

The object 'thedata' has exactly what I expected. the line 'thedata'
prints the correct contents of the file with each row prepended by a
line number.  The last few lines are:

8191 2009-06-17 48.40
8192 2009-06-18 47.72
8193 2009-06-19 48.83
8194 2009-06-22 46.85
8195 2009-06-23 47.11
8196 2009-06-24 46.97
8197 2009-06-25 47.43

The number of lines (8197), dates (and their format) and prices are correct.

The last four lines produce the following output:
 x = as.timeseries(thedata)
Error: could not find function as.timeseries
 x
Error: object 'x' not found
 table.Drawdowns(thedata,top = 10)
Error in 1 + na.omit(x) : non-numeric argument to binary operator
 table.Drawdowns(thedata$V2, top = 10)
Error in if (thisSign == priorSign) { :
  missing value where TRUE/FALSE needed


Are the functions in your example in Rmetrics or PerformanceAnalytics?
(like I said, I am just beginning this exploration, and I started with
table.Drawdowns because it produces information that I need first)
And given that my data is in tab delimited files, and can be read
using read.csv, how do I feed my data into your four statements?

My guess is I am missing something in coercing my data in (the data
frame?) thedata into a timeseries array of the sort the time series
analysis functions need: and one of the things I find a bit confusing
is that some of the documentation for this mentions S3 classes and
some mentions S4 classes (I don't know if that means I have to make
multiple copies of my data to get the output I need).  I could coerce
thedata$V2 into a numeric vector, but I'd rather not separate the
prices from their dates unless that is necessary (how would one
produce monthly, annual or annualized rates of return if one did
that?).

Thanks

Ted

On Fri, Jul 3, 2009 at 6:39 PM, Mark Knechtmarkkne...@gmail.com wrote:
 On Fri, Jul 3, 2009 at 2:48 PM, Ted Byersr.ted.by...@gmail.com wrote:
 I have hundreds of megabytes of price data time series, and perl
 scripts that extract it to tab delimited files (I have C++ programs
 that must analyse this data too, so I get Perl to extract it rather
 than have multiple connections to the DB).

 I can read the data into an R object without any problems.

 thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
 = FALSE, na.strings=)
 thedata

 The above statements give me precisely what I expect.  The last few
 lines of output are:
 8190 2009-06-16 49.30
 8191 2009-06-17 48.40
 8192 2009-06-18 47.72
 8193 2009-06-19 48.83
 8194 2009-06-22 46.85
 8195 2009-06-23 47.11
 8196 2009-06-24 46.97
 8197 2009-06-25 47.43

 I have loaded Rmetrics and PerformanceAnalytics, among other packages.
  I tried as.timeseries, but R2.9.1 tells me there is no such function.
 I tried as.ts(thedata), but that only replaces the date field by the
 row label in 'thedata'.

 If I apply the performance analytics drawdowns function to either
 thedata or thedate$V2, I get errors:
 table.Drawdowns(thedata,top = 10)
 Error in 1 + na.omit(x) : non-numeric argument to binary operator
 table.Drawdowns(thedata$V2, top = 10)
 Error in if (thisSign == priorSign) { :
  missing value where TRUE/FALSE needed


 thedata$V2 by itself does give me the price data from the file.

 I am a relative novice in using R for timeseries, so I wouldn't be
 surprised it I missed something that would be obvious to someone more
 practiced in using R, but I don't see what that could be from the
 documentation of the functions I am looking at using.  I have no
 shortage of data, and I don't want to write C++ code, or perl code, to
 do all the kinds of calculations provided in, Rmetrics and
 performanceanalytics, but getting my data into the functions these
 packages provide is killing me!

 What did I miss?

 Thanks

 Ted

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 Could you supply some portion of the results when you run the example
 on your data? The example goes like:

 data(edhec)
 R=edhec[,Funds.of.Funds]
 findDrawdowns(R)
 sortDrawdowns(findDrawdowns(R))

 How are you using the function with your data?

 - Mark


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The time series analysis functions/packages don't seem to like my data

2009-07-03 Thread Ted Byers
Hi David,

Thanks for replying.

On Fri, Jul 3, 2009 at 8:08 PM, David Winsemiusdwinsem...@comcast.net wrote:

 On Jul 3, 2009, at 7:34 PM, Ted Byers wrote:

 Hi Mark

 Thanks for replying.

 Here is a short snippet that reproduces the problem:

 library(PerformanceAnalytics)
 thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
 = FALSE, na.strings=)
 thedata
 x = as.timeseries(thedata)
 x
 table.Drawdowns(thedata,top = 10)
 table.Drawdowns(thedata$V2, top = 10)

 The object 'thedata' has exactly what I expected. the line 'thedata'
 prints the correct contents of the file with each row prepended by a
 line number.  The last few lines are:

 8191 2009-06-17 48.40
 8192 2009-06-18 47.72
 8193 2009-06-19 48.83
 8194 2009-06-22 46.85
 8195 2009-06-23 47.11
 8196 2009-06-24 46.97
 8197 2009-06-25 47.43

 The number of lines (8197), dates (and their format) and prices are
 correct.

 The last four lines produce the following output:

 x = as.timeseries(thedata)

 Error: could not find function as.timeseries

 That is not telling you that there is no such function but rather that you
 have not loaded the package that contains it. To find out what package (
 which you have installed on your machine) contains a function, you type one
 of these equivalents:

 ??as.timeseries

 help.search(as.timeseries)

I did  this, which is why I tried as.timeseries in the first place.

 If the needed package is not installed on your machine then you need to use
 one of the R search sites. I use:
 http://search.r-project.org/nmz.html

 In my installation there is a function named as.timeSeries in the package
 timeSeries. Not sure if that is the function you want. (Spelling must be
 exact in R.) If it is, then try:

 library(timeSeries)

timeSeries was already installed.  And using library(timeSeries)
succeeds but does not help.

 x

 Error: object 'x' not found

 Not surprising, since the effort to create x failed.

Right,.  I wasn't surprised by this.

 table.Drawdowns(thedata,top = 10)

 Error in 1 + na.omit(x) : non-numeric argument to binary operator

 Not sure whether this is due to earlier errors or something that is wrong
 with your data. Most probably the latter, and since you have not reduced it
 to a reproducible example, no one can tell from a distance. If you were
 expecting the operation of giving thedata to as.timeseries() to have a
 lasting effect on thedata, you need to re-read the introductory material
 on R that is readily available. That's not how the language works.

This only thing missing from my example is the data file itself.  I
have no problems providing that too, but I didn't think that was
permitted (and it is too large to embed within a message.

No, I did not expect thedata to be modified by as.timeseries.  I
just thought I'd try to see if table.Drawdowns would accept a data
frame.  And my call to  table.Drawdowns(thedata$V2, top = 10) was to
see if it would even accept a numeric vector (which is what I'd
expected the price data to be represented as).

Thanks

Ted

 table.Drawdowns(thedata$V2, top = 10)

 Error in if (thisSign == priorSign) { :
  missing value where TRUE/FALSE needed


 Are the functions in your example in Rmetrics or PerformanceAnalytics?
 (like I said, I am just beginning this exploration, and I started with
 table.Drawdowns because it produces information that I need first)
 And given that my data is in tab delimited files, and can be read
 using read.csv, how do I feed my data into your four statements?

 My guess is I am missing something in coercing my data in (the data
 frame?) thedata into a timeseries array of the sort the time series
 analysis functions need: and one of the things I find a bit confusing
 is that some of the documentation for this mentions S3 classes and
 some mentions S4 classes (I don't know if that means I have to make
 multiple copies of my data to get the output I need).  I could coerce
 thedata$V2 into a numeric vector, but I'd rather not separate the
 prices from their dates unless that is necessary (how would one
 produce monthly, annual or annualized rates of return if one did
 that?).

 Thanks

 Ted

 On Fri, Jul 3, 2009 at 6:39 PM, Mark Knechtmarkkne...@gmail.com wrote:

 On Fri, Jul 3, 2009 at 2:48 PM, Ted Byersr.ted.by...@gmail.com wrote:

 I have hundreds of megabytes of price data time series, and perl
 scripts that extract it to tab delimited files (I have C++ programs
 that must analyse this data too, so I get Perl to extract it rather
 than have multiple connections to the DB).

 I can read the data into an R object without any problems.

 thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
 = FALSE, na.strings=)
 thedata

 The above statements give me precisely what I expect.  The last few
 lines of output are:
 8190 2009-06-16 49.30
 8191 2009-06-17 48.40
 8192 2009-06-18 47.72
 8193 2009-06-19 48.83
 8194 2009-06-22 46.85
 8195 2009-06-23 47.11
 8196 2009-06-24 46.97
 8197 2009-06-25 47.43

 I have loaded

Re: [R] The time series analysis functions/packages don't seem to like my data

2009-07-03 Thread Ted Byers
Hi Gabor,  Thanks.

On Fri, Jul 3, 2009 at 8:25 PM, Gabor
Grothendieckggrothendi...@gmail.com wrote:
 # 1. You can directly read your data into a zoo series like this:

 Lines - 8190 2009-06-16 49.30
 8191 2009-06-17 48.40
 8192 2009-06-18 47.72
 8193 2009-06-19 48.83
 8194 2009-06-22 46.85
 8195 2009-06-23 47.11
 8196 2009-06-24 46.97
 8197 2009-06-25 47.43


OK.  Now I have to read up on zoo too.  I was going to get to that, as
I saw it mentioned in a couple views related to analyzing financial
data.

I apologize if this is a naive question, but if I am reading my data
successfully using:

thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
= FALSE, na.strings=)

can my thedata be used in the same way as your Lines?  Or would
that be a different function call?

What is your Lines anyway: a vector containing a series of strings?
a matrix of strings? one long string distributed over a series of
lines?

 library(zoo)
 z - read.zoo(textConnection(Lines), index = 2)

 # and from that you can readily convert it to
 # other time series formats if need be.

 # 2. Read ?table.Drawdowns.  It asks for __returns__, not raw
 # data as input.

OOPS, so I'll need an extra step.  It is trivial to convert my data to
daily deltas.  I was more concerned at the moment with just getting my
time series data into a form the time series functions require.

Thank you.  This is quite useful.

Cheers

Ted

 library(PerformanceAnalytics)
 table.Drawdowns(diff(log(z$V3)))

 That gives me an error and looking into it it seems
 likely that table.Drawdowns fails when there is only one
 drawdown.

 library(help = PerformanceAnalytics)

 will give you the author's email address to whom you
 can report the problem.

 On Fri, Jul 3, 2009 at 7:34 PM, Ted Byersr.ted.by...@gmail.com wrote:
 Hi Mark

 Thanks for replying.

 Here is a short snippet that reproduces the problem:

 library(PerformanceAnalytics)
 thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
 = FALSE, na.strings=)
 thedata
 x = as.timeseries(thedata)
 x
 table.Drawdowns(thedata,top = 10)
 table.Drawdowns(thedata$V2, top = 10)

 The object 'thedata' has exactly what I expected. the line 'thedata'
 prints the correct contents of the file with each row prepended by a
 line number.  The last few lines are:

 8191 2009-06-17 48.40
 8192 2009-06-18 47.72
 8193 2009-06-19 48.83
 8194 2009-06-22 46.85
 8195 2009-06-23 47.11
 8196 2009-06-24 46.97
 8197 2009-06-25 47.43

 The number of lines (8197), dates (and their format) and prices are correct.

 The last four lines produce the following output:
 x = as.timeseries(thedata)
 Error: could not find function as.timeseries
 x
 Error: object 'x' not found
 table.Drawdowns(thedata,top = 10)
 Error in 1 + na.omit(x) : non-numeric argument to binary operator
 table.Drawdowns(thedata$V2, top = 10)
 Error in if (thisSign == priorSign) { :
  missing value where TRUE/FALSE needed


 Are the functions in your example in Rmetrics or PerformanceAnalytics?
 (like I said, I am just beginning this exploration, and I started with
 table.Drawdowns because it produces information that I need first)
 And given that my data is in tab delimited files, and can be read
 using read.csv, how do I feed my data into your four statements?

 My guess is I am missing something in coercing my data in (the data
 frame?) thedata into a timeseries array of the sort the time series
 analysis functions need: and one of the things I find a bit confusing
 is that some of the documentation for this mentions S3 classes and
 some mentions S4 classes (I don't know if that means I have to make
 multiple copies of my data to get the output I need).  I could coerce
 thedata$V2 into a numeric vector, but I'd rather not separate the
 prices from their dates unless that is necessary (how would one
 produce monthly, annual or annualized rates of return if one did
 that?).

 Thanks

 Ted

 On Fri, Jul 3, 2009 at 6:39 PM, Mark Knechtmarkkne...@gmail.com wrote:
 On Fri, Jul 3, 2009 at 2:48 PM, Ted Byersr.ted.by...@gmail.com wrote:
 I have hundreds of megabytes of price data time series, and perl
 scripts that extract it to tab delimited files (I have C++ programs
 that must analyse this data too, so I get Perl to extract it rather
 than have multiple connections to the DB).

 I can read the data into an R object without any problems.

 thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
 = FALSE, na.strings=)
 thedata

 The above statements give me precisely what I expect.  The last few
 lines of output are:
 8190 2009-06-16 49.30
 8191 2009-06-17 48.40
 8192 2009-06-18 47.72
 8193 2009-06-19 48.83
 8194 2009-06-22 46.85
 8195 2009-06-23 47.11
 8196 2009-06-24 46.97
 8197 2009-06-25 47.43

 I have loaded Rmetrics and PerformanceAnalytics, among other packages.
  I tried as.timeseries, but R2.9.1 tells me there is no such function.
 I tried as.ts(thedata), but that only replaces the date field by the
 row label in 'thedata'.

 If I apply the 

Re: [R] The time series analysis functions/packages don't seem to like my data

2009-07-03 Thread Ted Byers
Hi Mark,

Thanks.

Your example works fine.  But I see you're struggling with the same
issue that I am.  I also see the format of the dates in the dataset
you use in your example is the same format that my dates are in.

I just read it, so I haven't had a chance to investigate, but you
might take a look at Gabor's response to me to see if read.zoo can
help move data from a file (or whatever is returned by read.csv) into
a zoo series.

Cheers,

Ted

On Fri, Jul 3, 2009 at 8:40 PM, Mark Knechtmarkkne...@gmail.com wrote:
 On Fri, Jul 3, 2009 at 4:34 PM, Ted Byersr.ted.by...@gmail.com wrote:
 Hi Mark

 Thanks for replying.

 Here is a short snippet that reproduces the problem:

 library(PerformanceAnalytics)
 thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
 = FALSE, na.strings=)
 thedata
 x = as.timeseries(thedata)
 x
 table.Drawdowns(thedata,top = 10)
 table.Drawdowns(thedata$V2, top = 10)

 The object 'thedata' has exactly what I expected. the line 'thedata'
 prints the correct contents of the file with each row prepended by a
 line number.  The last few lines are:

 8191 2009-06-17 48.40
 8192 2009-06-18 47.72
 8193 2009-06-19 48.83
 8194 2009-06-22 46.85
 8195 2009-06-23 47.11
 8196 2009-06-24 46.97
 8197 2009-06-25 47.43

 The number of lines (8197), dates (and their format) and prices are correct.

 The last four lines produce the following output:
 x = as.timeseries(thedata)
 Error: could not find function as.timeseries
 x
 Error: object 'x' not found
 table.Drawdowns(thedata,top = 10)
 Error in 1 + na.omit(x) : non-numeric argument to binary operator
 table.Drawdowns(thedata$V2, top = 10)
 Error in if (thisSign == priorSign) { :
  missing value where TRUE/FALSE needed


 Are the functions in your example in Rmetrics or PerformanceAnalytics?
 (like I said, I am just beginning this exploration, and I started with
 table.Drawdowns because it produces information that I need first)
 And given that my data is in tab delimited files, and can be read
 using read.csv, how do I feed my data into your four statements?

 My guess is I am missing something in coercing my data in (the data
 frame?) thedata into a timeseries array of the sort the time series
 analysis functions need: and one of the things I find a bit confusing
 is that some of the documentation for this mentions S3 classes and
 some mentions S4 classes (I don't know if that means I have to make
 multiple copies of my data to get the output I need).  I could coerce
 thedata$V2 into a numeric vector, but I'd rather not separate the
 prices from their dates unless that is necessary (how would one
 produce monthly, annual or annualized rates of return if one did
 that?).

 Thanks

 Ted

 On Fri, Jul 3, 2009 at 6:39 PM, Mark Knechtmarkkne...@gmail.com wrote:
 On Fri, Jul 3, 2009 at 2:48 PM, Ted Byersr.ted.by...@gmail.com wrote:
 I have hundreds of megabytes of price data time series, and perl
 scripts that extract it to tab delimited files (I have C++ programs
 that must analyse this data too, so I get Perl to extract it rather
 than have multiple connections to the DB).

 I can read the data into an R object without any problems.

 thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
 = FALSE, na.strings=)
 thedata

 The above statements give me precisely what I expect.  The last few
 lines of output are:
 8190 2009-06-16 49.30
 8191 2009-06-17 48.40
 8192 2009-06-18 47.72
 8193 2009-06-19 48.83
 8194 2009-06-22 46.85
 8195 2009-06-23 47.11
 8196 2009-06-24 46.97
 8197 2009-06-25 47.43

 I have loaded Rmetrics and PerformanceAnalytics, among other packages.
  I tried as.timeseries, but R2.9.1 tells me there is no such function.
 I tried as.ts(thedata), but that only replaces the date field by the
 row label in 'thedata'.

 If I apply the performance analytics drawdowns function to either
 thedata or thedate$V2, I get errors:
 table.Drawdowns(thedata,top = 10)
 Error in 1 + na.omit(x) : non-numeric argument to binary operator
 table.Drawdowns(thedata$V2, top = 10)
 Error in if (thisSign == priorSign) { :
  missing value where TRUE/FALSE needed


 thedata$V2 by itself does give me the price data from the file.

 I am a relative novice in using R for timeseries, so I wouldn't be
 surprised it I missed something that would be obvious to someone more
 practiced in using R, but I don't see what that could be from the
 documentation of the functions I am looking at using.  I have no
 shortage of data, and I don't want to write C++ code, or perl code, to
 do all the kinds of calculations provided in, Rmetrics and
 performanceanalytics, but getting my data into the functions these
 packages provide is killing me!

 What did I miss?

 Thanks

 Ted

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 Could 

Re: [R] The time series analysis functions/packages don't seem to like my data

2009-07-03 Thread Ted Byers
Sorry, I should have read the read.zoo documentation before replying
to thank Gabor for his repsonse.

Here is how it starts:

read.zoo(zoo) R Documentation

Reading and Writing zoo Series
Description
read.zoo and write.zoo are convenience functions for reading and
writing zoo series from/to text files. They are convenience
interfaces to read.table and write.table, respectively.

Usage
read.zoo(file, format = , tz = , FUN = NULL,
  regular = FALSE, index.column = 1, aggregate = FALSE, ...)

Clearly this should solve both our problems.

Cheers,

Ted

On Fri, Jul 3, 2009 at 8:40 PM, Mark Knechtmarkkne...@gmail.com wrote:
 On Fri, Jul 3, 2009 at 4:34 PM, Ted Byersr.ted.by...@gmail.com wrote:
 Hi Mark

 Thanks for replying.

 Here is a short snippet that reproduces the problem:

 library(PerformanceAnalytics)
 thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
 = FALSE, na.strings=)
 thedata
 x = as.timeseries(thedata)
 x
 table.Drawdowns(thedata,top = 10)
 table.Drawdowns(thedata$V2, top = 10)

 The object 'thedata' has exactly what I expected. the line 'thedata'
 prints the correct contents of the file with each row prepended by a
 line number.  The last few lines are:

 8191 2009-06-17 48.40
 8192 2009-06-18 47.72
 8193 2009-06-19 48.83
 8194 2009-06-22 46.85
 8195 2009-06-23 47.11
 8196 2009-06-24 46.97
 8197 2009-06-25 47.43

 The number of lines (8197), dates (and their format) and prices are correct.

 The last four lines produce the following output:
 x = as.timeseries(thedata)
 Error: could not find function as.timeseries
 x
 Error: object 'x' not found
 table.Drawdowns(thedata,top = 10)
 Error in 1 + na.omit(x) : non-numeric argument to binary operator
 table.Drawdowns(thedata$V2, top = 10)
 Error in if (thisSign == priorSign) { :
  missing value where TRUE/FALSE needed


 Are the functions in your example in Rmetrics or PerformanceAnalytics?
 (like I said, I am just beginning this exploration, and I started with
 table.Drawdowns because it produces information that I need first)
 And given that my data is in tab delimited files, and can be read
 using read.csv, how do I feed my data into your four statements?

 My guess is I am missing something in coercing my data in (the data
 frame?) thedata into a timeseries array of the sort the time series
 analysis functions need: and one of the things I find a bit confusing
 is that some of the documentation for this mentions S3 classes and
 some mentions S4 classes (I don't know if that means I have to make
 multiple copies of my data to get the output I need).  I could coerce
 thedata$V2 into a numeric vector, but I'd rather not separate the
 prices from their dates unless that is necessary (how would one
 produce monthly, annual or annualized rates of return if one did
 that?).

 Thanks

 Ted

 On Fri, Jul 3, 2009 at 6:39 PM, Mark Knechtmarkkne...@gmail.com wrote:
 On Fri, Jul 3, 2009 at 2:48 PM, Ted Byersr.ted.by...@gmail.com wrote:
 I have hundreds of megabytes of price data time series, and perl
 scripts that extract it to tab delimited files (I have C++ programs
 that must analyse this data too, so I get Perl to extract it rather
 than have multiple connections to the DB).

 I can read the data into an R object without any problems.

 thedata = read.csv(K:\\Work\\SignalTest\\BP.csv, sep = \t, header
 = FALSE, na.strings=)
 thedata

 The above statements give me precisely what I expect.  The last few
 lines of output are:
 8190 2009-06-16 49.30
 8191 2009-06-17 48.40
 8192 2009-06-18 47.72
 8193 2009-06-19 48.83
 8194 2009-06-22 46.85
 8195 2009-06-23 47.11
 8196 2009-06-24 46.97
 8197 2009-06-25 47.43

 I have loaded Rmetrics and PerformanceAnalytics, among other packages.
  I tried as.timeseries, but R2.9.1 tells me there is no such function.
 I tried as.ts(thedata), but that only replaces the date field by the
 row label in 'thedata'.

 If I apply the performance analytics drawdowns function to either
 thedata or thedate$V2, I get errors:
 table.Drawdowns(thedata,top = 10)
 Error in 1 + na.omit(x) : non-numeric argument to binary operator
 table.Drawdowns(thedata$V2, top = 10)
 Error in if (thisSign == priorSign) { :
  missing value where TRUE/FALSE needed


 thedata$V2 by itself does give me the price data from the file.

 I am a relative novice in using R for timeseries, so I wouldn't be
 surprised it I missed something that would be obvious to someone more
 practiced in using R, but I don't see what that could be from the
 documentation of the functions I am looking at using.  I have no
 shortage of data, and I don't want to write C++ code, or perl code, to
 do all the kinds of calculations provided in, Rmetrics and
 performanceanalytics, but getting my data into the functions these
 packages provide is killing me!

 What did I miss?

 Thanks

 Ted

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 

Re: [R] The time series analysis functions/packages don't seem to like my data

2009-07-03 Thread Ted Byers
On Fri, Jul 3, 2009 at 9:05 PM, Mark Knechtmarkkne...@gmail.com wrote:
 On Fri, Jul 3, 2009 at 5:54 PM, Ted Byersr.ted.by...@gmail.com wrote:
 Sorry, I should have read the read.zoo documentation before replying
 to thank Gabor for his repsonse.

 Here is how it starts:

 read.zoo(zoo) R Documentation

 Reading and Writing zoo Series
 Description
 read.zoo and write.zoo are convenience functions for reading and
 writing zoo series from/to text files. They are convenience
 interfaces to read.table and write.table, respectively.

 Usage
 read.zoo(file, format = , tz = , FUN = NULL,
  regular = FALSE, index.column = 1, aggregate = FALSE, ...)

 Clearly this should solve both our problems.

 Cheers,

 Ted


 Possibly but I think the big issue is the findDrawdowns function is
 looking for minus signs to signal the drawdown. I down think it's
 doing calculations from a simple equity curve.

 All of these functions (findDrawdowns, table.Drawdowns, etc.) all say
 they will accept a data.frame.

 My guess is the issue isn't so much dates, names, or anything else as
 much as making sure you have a column of percentage rise and fall
 numbers expressed like

 0.03
 0.02
 -0.025
 0.10

But this is trivial.  I have to read the documentation further to see
if it wants rates of return as a fraction (or percentages), or if
daily deltas will do.  Either way, it is trivial to get such numbers
(in my case in the perl script I use to draw the data from my
database.

 Even findDrawdowns(edhec[,5]) does the right thing. Copying it to R
 wasn't necessary. edhec has lots of columns. You can pick and one of
 them and get a table.

This is good to know as it makes some of the analyses I need to do
easier.  I can create a single file with a number of series that need
to be compared WRT drawdowns, VaR, c.

Cheers,

Ted

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Where can I find information on how to subsample a time series?

2009-06-26 Thread Ted Byers
I suspect I'm looking in the wrong places, so guidance to the relevant
documentation would be as welcome as a little code snippet.

I have time series data stored in a MySQL database.  There is the usual DATE
field, along with a double precision number: there are daily values
(including only normal working days: Monday through Friday).  I actually
have to do a couple things here.  Because of how the result is to be used, I
need to first create two time series.  The first is the delta between 22
working days, and the second is the delta between 66 working days.  I have
hundreds of these datasets, and some go back 30 years.  I need to estimate
the correlation between 22 day deltas (i.e. is the delta for one month
correlated with that of the previous month) and between the 22 day delta and
the 66 day delta that ends the day before the the first day of the 22 day
delta.  However, I KNOW the statistical properties of the time series are
not constant (so the usual assumptions do not apply to the entire series).
Therefore, I want to subsample finely enough to get a reasonably sensible
correlation and examine how that changes through time.  (There are no tests
of significance here: I just want to explore just how much the properties of
these series change through time).

I have C++ code, admittedly not written particularly efficiently, that does
this.  The question is, is it possible to do this reasonably efficiently
using R?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mystery Error in midnightStandard

2009-01-28 Thread Ted Byers
Hi Yohan,  Thanks.

On Wed, Jan 28, 2009 at 4:57 AM, Yohan Chalabi chal...@phys.ethz.ch wrote:

  TB == Ted Byers r.ted.by...@gmail.com
  on Tue, 27 Jan 2009 16:00:27 -0500

   TB I wasn't even aware I was using midnightStandard.  You won't
   TB find it in my
   TB script.
   TB
   TB Here is the relevant loop:
   TB
   TB date1 = timeDate(charvec = Sys.Date(), format = %Y-%m-%d)
   TB date1
   TB dow = 3;
   TB for (i in 1:length(V4) ) {
   TB x = read.csv(as.character(V4[[i]]), header = FALSE,
   TB na.strings=);
   TB y = x[,1];
   TB year = V2[[i]];
   TB week = V3[[i]];
   TB dtstr = sprintf(%i-%i-%i,year,week,dow);
   TB date2 = timeDate(dtstr, format = %Y-%U-%w);
   TB resultsdataframe[[i]] - difftimeDate(date1,date2,units =
   TB weeks);
   TB fp = fitdistr(y,exponential);
   TB print(c(V1[[i]],V2[[i]],V3[[i]],fp,fp));
   TB print(c(year,week,date2,resultsdataframe[[i]]));
   TB resultsdataframe[[i]] - fp;
   TB resultsdataframe[[i]] - fp;
   TB }
   TB
   TB It fails with a little more than 100 records left in V4.
   TB
   TB The full error message is:
   TB
   TB Error in midnightStandard(charvec, format) :
   TB 'charvec' has non-NA entries of different number of characters

 timeDate() uses the midnight standard. The function 'midnightStandard'
 assumes that all entries in 'charvec' have the same 'format'. Can you
 please check if this is the case?


It is certain that all entries have the same format, but I'm starting to
think that the error message is something of a red herring.  Consider this:

 year = 2009
 week = 0
 day = 3
 datestr = sprintf(%i-%i-%i,year,week,day);datestr
[1] 2009-0-3
 date1 = timeDate(datestr, format = %Y-%U-%w);
 date1
GMT
[1] [NA]
 day = 4
 datestr = sprintf(%i-%i-%i,year,week,day);datestr
[1] 2009-0-4
 date1 = timeDate(datestr, format = %Y-%U-%w);
 date1
GMT
[1] [2009-01-01]

 datestr = sprintf(%i-%i-%i,year,week,3);datestr
[1] 2009-0-3
 date2 = timeDate(datestr, format = %Y-%U-%w);date2
GMT
[1] [NA]
 difftimeDate(date2,date1, units = weeks)
Error in midnightStandard(charvec, format) :
  'charvec' has non-NA entries of different number of characters
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf



The first values for year, week and day are the values on which my loop
dies.  It returns 'NA' here.  It seems clear that it is returning NA because
the date that data corresponds to is 2008-12-31.

The error is being produced by difftimeDate rather than timeDate (as shown
by the above session).  But that represents a flaw in the function design.
It should fail when taking the elapsed time between a null and the present,
but if I wrote such a function, I'd have it return null (perhaps with a
warning) rather than just die.

A bigger issue is that timeDate ought never give null here (which is what I
assume 'NA' means), since all the data comes from transaction data with real
dates, so the elapsed time, measured in weeks, ought to always be a valid
real number that is positive semidefinite.  I have not yet come to any
conclusions as to how it ought to behave (whether to return new years day,
along with a warning, or to return the date requested by reinvoking itself
with the year and week adjusted so a valid date is returned).

On a practical side, how would I test date2 to see if it is null, so I can
give it a sensible default value?

A more troubling thought is that with this handling of dates in this
combination of SQL (my group by clause uses
YEAR(transaction_date),WEEK(transaction_date)) to get the data and R to
process it, the week containing new years day will ALWAYS be split in two at
the first second of the new year. I'm going to have to either figure out a
way to correct this, or ignore it (as it doesn't actually make things wrong,
but rather it splits a sample into two unequal parts).

Thoughts?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mystery Error in midnightStandard

2009-01-28 Thread Ted Byers
Hi Yohan,

On Wed, Jan 28, 2009 at 10:28 AM, Yohan Chalabi chal...@phys.ethz.chwrote:

  TB == Ted Byers r.ted.by...@gmail.com
  on Wed, 28 Jan 2009 09:30:58 -0500

   TB It is certain that all entries have the same format, but I'm
   TB starting to
   TB think that the error message is something of a red herring.
   TB Consider this:
   TB
   TB  year = 2009
   TB  week = 0
   TB  day = 3
   TB  datestr = sprintf(%i-%i-%i,year,week,day);datestr
   TB [1] 2009-0-3
   TB  date1 = timeDate(datestr, format = %Y-%U-%w);
   TB  date1
   TB GMT
   TB [1] [NA]
   TB  day = 4
   TB  datestr = sprintf(%i-%i-%i,year,week,day);datestr
   TB [1] 2009-0-4
   TB  date1 = timeDate(datestr, format = %Y-%U-%w);
   TB  date1
   TB GMT
   TB [1] [2009-01-01]
   TB 
   TB  datestr = sprintf(%i-%i-%i,year,week,3);datestr
   TB [1] 2009-0-3
   TB  date2 = timeDate(datestr, format = %Y-%U-%w);date2
   TB GMT
   TB [1] [NA]
   TB  difftimeDate(date2,date1, units = weeks)
TB Error in midnightStandard(charvec, format) :
   TB 'charvec' has non-NA entries of different number of characters
TB In addition: Warning messages:
   TB 1: In min(x) : no non-missing arguments to min; returning Inf
   TB 2: In max(x) : no non-missing arguments to max; returning -Inf
   TB
   TB
   TB
   TB The first values for year, week and day are the values on
   TB which my loop
   TB dies.  It returns 'NA' here.  It seems clear that it is
   TB returning NA because
   TB the date that data corresponds to is 2008-12-31.
   TB
   TB The error is being produced by difftimeDate rather than timeDate
   TB (as shown
   TB by the above session).  But that represents a flaw in the
   TB function design.

 This is not a flaw in timeDate. it behaves the same way as
 'as.POSIXct'


That the two behave the same doesn't change the assessment that the design
is flawed.  That doesn't mean that the function is wrong.  It means only
that the behaviour can be made more useful.  For example, in SQL, if a given
calculation returns NULL, and the result is subsequently used in another
calculation, the result that returns is also NULL.  That is quite useful,
and admits algorithms that can react appropriately to NULLs when necessary.
That is arguably better than forcing the code to fail the moment a NULL is
used in a secondary calculation.  In C++, OTOH, one can catch the problem
earlier using, e.g., exceptions, again allowing the program to complete even
when problems arise for certain values or combinations thereof.

As a software engineer, I understand the issues involved in creating
libraries.  If I want to incorporate the functionality of a given standard
suite of functions (e.g. ANSI C standard library functions, or posix
functions), my first step would be to ensure I can duplicate how they
behave.  But I would not stop there.  There are, for example, serious design
flaws in many ANSI C functions that, ignored, introduce serious security
defects in applications that use them.  I would therefore refactor them to
eliminate the security defects.  If they can not be eliminated, I would
replace the function in question by a similar function that does not have
that security defect.

Posix is a useful, but old, standard, and I am merely suggesting that once
you have duplicated it, look beyond it to ways it can be improved upon.
There is more to the design of a function than whether or not it gives the
right result with good input.  There is how it behaves when there is a
problem with the inputs and whether or not you force the calling code to die
when a problem arises or you give the calling code a way to react to such
problems.  When I add functions to my own C++ or Java libraries, I normally
include more bad input data in the unit tests than good data (though the
latter is sufficient to ensure correct results are invariably obtained),
precisely so I can document how it behaves when there is a problem and give
coders who use it a variety of options to use to deal with them.



 strptime(datestr, format = %Y-%U-%w)

 Instead of claiming that there is a flaw in the function you could have
 suggested an 'is.na' method for 'timeDate'.


At the time, I did not know about is.na.  I have spent the past hour trying
is.na, but to no avail.  I guess that is no surprise to you, but that it
would fail is not reflected in the R documentation of is.na.  That mentions
S3, but not S4.  As I just recently started using R, I have not yet looked
at what S3 and S4 are, so that is a few more hours of study before I get
this problem solved.



 I will add an 'is.na' method in the dev version of 'timeDate'.


Thanks.  I'll benefit from that once it makes it into the production
release.  In the mean time, I need to find a way to make something similar
now, in my script.

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http

[R] Can I create a timeDate object using only year and week of the year values?

2009-01-27 Thread Ted Byers
For a model I am working on, I have samples organized by year and week of
the year.  For this model, the data (year and week) comes from the basic
sample data, but I require a value representing the amount of time since the
sample was taken (actually, for the purpose of the model, it is sufficient
to use the number of weeks from the middle of the sample week to the
present).

What I have found so far includes:

library(Rmetrics)
time1 = timeDate(charvec = Sys.Date(), format = %Y-%m-%d, zone = ,
FinCenter = )
time2 = timeDate(2004-08-30, format = %Y-%m-%d, zone = , FinCenter =
)
difftimeDate(time1,time2,units = weeks)


Does timeDate use the format strings used by the UNIX date(1) command?  If
so, then can I safely assume timeDate will accept %Y-%U-%w, and behave
correctly?

Thanks,

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can I create a timeDate object using only year and week of the year values?

2009-01-27 Thread Ted Byers
Thanks Patrick.

On Tue, Jan 27, 2009 at 2:03 PM, Patrick Connolly 
p_conno...@slingshot.co.nz wrote:

 On Tue, 27-Jan-2009 at 11:36AM -0500, Ted Byers wrote:


 []


 | Does timeDate use the format strings used by the UNIX date(1)
 | command?  If so, then can I safely assume timeDate will accept
 | %Y-%U-%w, and behave correctly?

 Your chances are good.  To be sure, check out

 ?strptime

 HTH



According to ?strptime, the answer is yes; something I have confirmed with
limited trials.


 --
 ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
___Patrick Connolly
  {~._.~}   Great minds discuss ideas
  _( Y )_ Average minds discuss events
 (:_~*~_:)  Small minds discuss people
  (_)-(_)  . Eleanor Roosevelt

 ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.


Smart lady!  Too bad there are no great minds in power in these economically
interesting times.

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Mystery Error in midnightStandard

2009-01-27 Thread Ted Byers
I wasn't even aware I was using midnightStandard.  You won't find it in my
script.

Here is the relevant loop:

date1 = timeDate(charvec = Sys.Date(), format = %Y-%m-%d)
date1
dow = 3;
for (i in 1:length(V4) ) {
  x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=);
  y = x[,1];
  year = V2[[i]];
  week = V3[[i]];
  dtstr = sprintf(%i-%i-%i,year,week,dow);
  date2 = timeDate(dtstr, format = %Y-%U-%w);
  resultsdataframe$dt[[i]] - difftimeDate(date1,date2,units = weeks);
  fp = fitdistr(y,exponential);
  print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd));
  print(c(year,week,date2,resultsdataframe$dt[[i]]));
  resultsdataframe$estimate[[i]] - fp$estimate;
  resultsdataframe$sd[[i]] - fp$sd;
}

It fails with a little more than 100 records left in V4.

The full error message is:

Error in midnightStandard(charvec, format) :
  'charvec' has non-NA entries of different number of characters

Until it fails, date2 and resultsdataframe$dt[[i]] get correct values.

str() produces no surprises:

 str(resultsdataframe);
'data.frame':303 obs. of  6 variables:
 $ mid : int  171 206 206 206 206 206 206 206 206 218 ...
 $ year: int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
 $ week: int  16 17 18 19 21 26 31 35 51 40 ...
 $ dt  : num  39.9 38.9 37.9 36.9 34.9 ...
 $ estimate: num  Inf 0.25 Inf 0.0408 0.2 ...
 $ sd  : num  Inf 0.1768 Inf 0.0289 0.1414 ...

I would assume the error is related to my new code that manipulates dates,
as it doesn't occur in the earlier version that did not manipulate dates
(the relevant work being done, albeit very slowly, within the DB).

FTR: The year and week values are generated by MySQL using the YEAR and WEEK
functions applied to timestamps.  I do not know if it is relevant, but the
week value, at the point of failure, is 0 (a value that does not occur
earlier in the dataset, but several times subsequently), and I do not see
how a value of 0 for the week (legitimate in posix date formats) could
produce the error message I get.

Any thoughts on what is really wrong, and how to fix it?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Staging area for data before read into R

2008-10-21 Thread Ted Byers

There are tradeoffs no matter what route you take.  

I worked on a project a few years ago, repairing an MS Access DB that had
been constructed, data entry forms and all, by one of the consulting
engineers.  They supported that development because they found that even
with all the power and utility of Excel, in support of data entry, errors
were still much too common, requiring significant time and therefore money
to address.  The more data you have, the more costly it is to find and
repair the data when something goes awry.  Their problem was that it was put
together in haste by a man who knew nothing about RDBMS.  He was learning as
he went.  While what he produced was adequate for a single site with no
turnover in staff, and would have been fine if it was intended for his own
use in his consulting practice, it would inevitably have broken the moment
the service was extended to more than one site or the moment there was any
turnover in staff at all.  The client he delivered it to was a large mining
company that wanted to deploy it on all their mines and ore processing
facilities.  Yes the error rate in data entry went way down, but the mistake
was in trying to deliver a software product to a client without the input of
an experienced software engineer.

You can do validation in Access as you can in Excel, but Excel is not
designed to manage data where Access is, and both are crippled by their
dependance on VB (a seriouusly broken language: fine for scripting MS
Office, but not what you want to develop a real application - not that the
OP wants that anyway).  I don't want to beat on Excel, as it is a useful
tool when used for what it is designed for; and others have pointed out some
hazards when using it.  Dr. Snow is right in recommending going the route of
using an RDBMS and in saying that it isn't that hard to get started.  I'd be
recommending PostgreSQL, though, since it is relatively easy to use, and it
has pl/r (which lets you run R code within stored procedures in the DB)
which carries obvious advantages.

The bottom line is that the best option depends on your objectives and what
you need to do.  You can use Excel quite effectively if you are careful and
know what you're doing.  If you are going to manage data that will require
significant effort to enter, you may want something a bit more robust and
better designed to manage your data.  If you are going to deliver services
based on your software, you need a software engineer to ensure it doesn't
break on your client (that could be quite costly).  Since the OP is
apparently using it for his own purposes, and unlikely to be selling the
data or services based on it, the services of an engineer aren't needed,
though they can be useful if there are concerns about administering the DB
so as to guarantee the security of the data.  I could tell you tales of how
data that cost millions of dollars to collect were almost lost because a
consultant was careless in this regard and made mistakes in handling the
data.  Fortunately, recovery was quick in these cases because my colleagues
were diligent in maintaining backups.  But you get the point.  

Murphy's law says that whatever can go wrong, will.

There are plenty of options, and the OP will need to do what he's most
confortable doing.

If I were in his place, I'd say my data is sacred, and can not be replaced
(just as you can't step into the same stream twice); and therefore I'd use a
RDBMS to manage it, and the very moment it is all entered, I'd make a backup
of both the data (e.g. in MySQL I'd use mysqldump) AND the software, and
copy both backups to two CDs or DVDs.  And, if the data were originally
recorded on paper, I'd be scanning the pages and copying those images onto a
couple CDs or DVDs also: with two copies on optical media, one copy can be
stored in a fireproof vault while the other is in the office ready to be
used should a HDD fail, or some other disaster interrupt my work.  OK, so
I'm paranoid about my data, but I'd rather go the extra mile than risk
losing it.

Cheers,

Ted


Gabor Grothendieck wrote:
 
 Excel has a data validation facility and also has data input forms to
 facilitate data entry.
 
 On Tue, Oct 21, 2008 at 1:45 PM, Greg Snow [EMAIL PROTECTED] wrote:
 Stephen,

 One of the big problems with spreadsheets (other than the column limit in
 some) is that the standard entry mode allows too much flexibility which
 does nothing to help you avoid data entry errors.  The Webpage:
 http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html has some
 examples of this going wrong, including one that happened to my group
 where the column for dates was not preformatted, the dates were entered
 using European format, and Excel did 2 different wrong things with them
 making it very difficult to do anything with the data without major extra
 work.  If you are going to stick with a spreadsheet, then at a minimum
 you should start by naming all your columns, then formatting each 

Re: [R] Staging area for data before read into R

2008-10-21 Thread Ted Byers

I wasn't suggesting that the validation requires VB.  

Creating forms and handling form events does (unless MS has introduced new
utilities to hide all that since last I used it).

Some of the most interesting things I have seen done with Excel did involve
VB, and there are better tools to do most of those things.

Gabor Grothendieck wrote:
 
 On Tue, Oct 21, 2008 at 3:18 PM, Ted Byers [EMAIL PROTECTED] wrote:
 There are tradeoffs no matter what route you take.
 You can do validation in Access as you can in Excel, but Excel is not
 designed to manage data where Access is, and both are crippled by their
 dependance on VB (a seriouusly broken language: fine for scripting MS
 
 Excel can do validation without VB.  For example, you can restrict
 data to a certain range of dates, limit choices by using a list, or
 make sure that only positive whole numbers are entered all without
 any VB.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Staging-area-for-data-before-read-into-R-tp20075962p20099445.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Staging area for data before read into R

2008-10-21 Thread Ted Byers

No.  Excel, like most spreadsheets, does what it designed for reasonably
well.  It is easy to find fault, but not so easy to satisfy all one's
critics.  There is no doubt that Excel has faults, but it provides
significant modelling and analysis capability to users with no programming
expertise or limited experience using IT.

I have used it as a teaching tool for very basic modelling to undergraduate
students who would not have been able to do any modelling without it.  In a
one session course, there just isn't time to teach students enough
programming in any language for them to have a hope of producing an
interesting model.  But they can produce an interesting model with some
guidance using Excel.  Similarly, they can do elementary data analysis
entering their data into Excel and using it to analyse it.

Excel was designed primarily for business people, and I have seen them use
it effectively, doing things I don't fully understand (as I am not a
businessman).  But these same people would go into a catatonic state the
moment a discussion becomes technical or mathematical.  They describe Excel
as powerful, and until I become an expert MBA type, I won't knock them for
that.  If they find it useful, why would I argue with them.  

Don't get me wrong, I do not normally use it, and for 99% of the work I do,
it provides no value to me, so I do not have it installed on my own systems. 
I am better served by C++, Java, and the related tools specific to my work. 
But that it isn't useful to me, or apparently you, is not sufficient grounds
to question its utility for others (neither is the existance of bugs, as ALL
software has bugs: MS makes for an easy target, but I try to be as fair to
them as I am to an independant developer who works alone - lets not have
this degenerate into an attack on MS, please).  As a software engineer
myself, I won't knock the work of another just because what he's produced
isn't particularly useful for me.  I won't even knock him if I don't agree
with the design decisions he's made.  When that happens, it is likely I was
not part of his intended market: nothing more can be implied.


Rolf Turner-3 wrote:
 
 
 On 22/10/2008, at 8:18 AM, Ted Byers wrote:
 
   snip
 
 ... even with all the power and utility of Excel ...
 
   snip
 
 Is this some kind of joke?
 
   cheers,
 
   Rolf Turner
 
 ##
 Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Staging-area-for-data-before-read-into-R-tp20075962p20099848.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Staging area for data before read into R

2008-10-21 Thread Ted Byers
Ah, OK.  That is new since I used Excel last.

Thanks

On Tue, Oct 21, 2008 at 5:52 PM, Gabor Grothendieck
[EMAIL PROTECTED] wrote:
 You can create data entry forms without VB in Excel too.

 On Tue, Oct 21, 2008 at 5:09 PM, Ted Byers [EMAIL PROTECTED] wrote:

 I wasn't suggesting that the validation requires VB.

 Creating forms and handling form events does (unless MS has introduced new
 utilities to hide all that since last I used it).

 Some of the most interesting things I have seen done with Excel did involve
 VB, and there are better tools to do most of those things.

 Gabor Grothendieck wrote:

 On Tue, Oct 21, 2008 at 3:18 PM, Ted Byers [EMAIL PROTECTED] wrote:
 There are tradeoffs no matter what route you take.
 You can do validation in Access as you can in Excel, but Excel is not
 designed to manage data where Access is, and both are crippled by their
 dependance on VB (a seriouusly broken language: fine for scripting MS

 Excel can do validation without VB.  For example, you can restrict
 data to a certain range of dates, limit choices by using a list, or
 make sure that only positive whole numbers are entered all without
 any VB.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



 --
 View this message in context: 
 http://www.nabble.com/Staging-area-for-data-before-read-into-R-tp20075962p20099445.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to get estimate of confidence interval?

2008-10-20 Thread Ted Byers

I thought I was finished, having gotten everything to work as intended.  This
is a model of risk, and the short term forecasts look very good, given the
data collected after the estimates are produced (this model is intended to
be executed daily, to give a continuing picture of our risk).  But now there
is a new requirement.

I have weekly samples from a non-autonomous process (i.e. although well
modelled as a decay process, with an exponential distribution fitting the
decay times well, the rate estimates and their sd vary considerably from one
week to the next).  The total number of events to be expected from a given
sample over the next week can be easily estimated from a simple integral. 
And the total number of these events from all samples, is just the sum of
these estimates over all samples.  So far, so good (imagine you have a
sample of a variety of species of radionuclides all emitting alpha particles
with the same energy - so you can't tell from the decay event which species
produced the alpha particles).

I guess there are two parts of my question.  I get a fit of the exponential
distribution to each sample using fitdistr(x,exponential).  I am finding
the expected values vary by as much as a factor of 4, and the corresponding
estimates of sd vary by as much as a factor of 100 (some samples are MUCH
larger than others).  How do I go from the sd it gives to a 99% confidence
interval for the integral for that function from now through a week from now
(or to the end of time, or through the next month/quarter)?  And how do I
move from these estimates to get the expected value and confidence intervals
for the totals over all the samples?  I am a bit rusty on figuring out how
error propagates through model calculations (an online reference for this
would be handy, if you know of one).

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/How-to-get-estimate-of-confidence-interval--tp20073921p20073921.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Staging area for data before read into R

2008-10-20 Thread Ted Byers

Define better.

Really, it depends on what you need to do (are all your data appropriately
represented in a 2D array?) and what resources are available.  If all your
data can be represented using a 2D array, then Excel is probably your best
bet for th enear term.  If not, you might as well bite the bullit and learn
to use an RDBMS, as there are few other data management options that can
cope with relational or hierarchical or object oriented data.

I use a number of different RDBMS (ranging from MS SQL to PostgreSQL and
MySQL).  I also use Excel on occasion, and plain text editors (like Emacs),
to create CSV files.  Which I use depends on the details of the particular
problem I am facing.

While I have not yet explored them, I did notice that R includes a number of
facilities for editing data (and the list of options is all the longer when
I use help.search(edit).

It may be a bit quicker for you to study up on basic use of something like
PostgreSQL, combined with pl/r (something I wish MySQL had), than it would
be to diligently examine all the different options open to you using R.  (I
have a couple books I could recommend that would likely be sufficient for
you to figure out what you need to do with either PostgreSQL or MySQL in a
matter of a week or two).

HTH

Ted


stephen sefick wrote:
 
 I am wondering if there is a better alternative than Excel for data
 storage that does not require database knowledge (I will eventually
 have to learn this, but it is not on my immediate todo list).  I need
 something that is not limited to 256 columns... I don't need any of
 the built in functions in excel just a spreadsheet like program with
 cells that hold data in a data.frame format for a staging area before
 I get it into R.  Any help would be greatly appreciated.  This is not
 a direct r question, but all of you folks have more experience than I
 do and I am having a time finding what I need with google.
 thanks in advance
 
 -- 
 Stephen Sefick
 Research Scientist
 Southeastern Natural Sciences Academy
 
 Let's not spend our time and resources thinking about things that are
 so little or so large that all they really do for us is puff us up and
 make us feel like gods.  We are mammals, and have not exhausted the
 annoying little problems of being mammals.
 
   -K. Mullis
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Staging-area-for-data-before-read-into-R-tp20075962p20078353.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] dbAppy questions/clarifications

2008-10-15 Thread Ted Byers

In the example in the documentation, I see:

rs - dbSendQuery(con, 
 select Agent, ip_addr, DATA from pseudo_data order by Agent)
out - dbApply(rs, INDEX = Agent, 
FUN = function(x, grp) quantile(x$DATA, names=FALSE))

Maybe I am a bit thick, but it took me a while, and a kind hint from Phil,
to figure much of this out.

It is clear that the SQL orders the data by Agent, and the INDEX parameter
tells dbApply that FUN is to be applied to each group of values defined by
Agent (like applying SUM(DATA) in SQL using a GROUP BY clause).  If my
understanding is correct, out will be an array holding ordered pairs, with
the value of Agent and the corresponding values returned by FUN.

I take it FUN = function(x, grp) quantile(x$DATA, names=FALSE) is the
function definition for a function called FUN.  I would guess, then, that
the opening and closing braces are optional.  Is that correct? Or is this
something else?  I did not see a definition of 'grp'.  What is it?

Suppose the function I want to apply is fitdistr(x,exponential).  Would
I just replace quantile(x$DATA, names=FALSE) by
fitdistr(x,exponential)?

Finally, suppose the query I need to run is more complex, such as:

SELECT group_id,YEAR(my_date),WEEK(my_date),ndays FROM myTable ORDER BY
group_id,YEAR(my_date),WEEK(my_date);

Can dbApply handle applying fitdistr(x,exponential) to each group of
values defined by group_id,YEAR(my_date),WEEK(my_date)?  If so, how would
I change the call to dbsendQuery, and how would I insert the resulting
estimates using something like INSERT INTO myResults
(group_id,year,week,rate,sd) VALUES (?,?,?,?);?  

Once I get this, I can do everything else within a stored procedure in
MySQL.  I get the idea of using,e.g., sprintf to interpolate values I need
to insert into a query string, but it is a question of how to get the values
I need from 'out' (to use the above example), and how to iterate over them
to do the SQL INSERT.  

Actually, would 'dbWriteTable' handle inserting these values efficiently? 
If so, how do I ensure it maps the group_id,year, week, c. from 'out' to
the right columns in my results table (what I have in mind involves a table
with a couple extra columns that would take appropriate default values)?

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/dbAppy-questions-clarifications-tp2632p2632.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Can R scripts executed in batch mode take a commandline argument?

2008-10-15 Thread Ted Byers

I have examined the documentation for batch mode use of R:

R CMD BATCH [options] infile [outfile]

The documentation for this seems rather spartan.  

Running R CMD BATCH --help gives me info on only two options: one for
getting help and the other to get the version.  I see, further on, that
there are options for retoring and saving sessions (which I do not need to
do in this case), but are there other options defined?  If so, what are they
and how are they to be used?

However, it goes on to say: Further arguments starting with a '-' are
considered as options as long as '--' was not encountered, and are passed on
to the R process, which by default is started with '--restore --save'.

I see here it says further arguments starting with a '-' are passed to the R
process, but usage is not clear.  For example, if I write a script that
should take as a commandline argument the name of a file that contains a
series of numbers that I want to place into a vector which, in turn I want
to pass to fitdistr(x,exponential), what do I do to get that file name
from the commandline and pass it to, say, read.csv?

BTW: How would I tell it that there is no need to restore and save?

If I can't pass a commandline argument, do I have to write the arguments in
afile, and have that file read each time I need to run the script?

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/Can-R-scripts-executed-in-batch-mode-take-a-commandline-argument--tp2914p2914.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Argh! Trouble using string data read from a file

2008-10-15 Thread Ted Byers

Here is what I tried:

optdata =
read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
header = FALSE, na.strings=)
optdata
attach(optdata)
for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE,
na.strings=);x }

And here  is the outcome (just a few of the 60 records successfully read):
 optdata =
 read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
 header = FALSE, na.strings=)
 optdata
V1   V2 V3V4
1  251 2008 18 Plus_Shipping.2008.18.dat
2  251 2008 19 Plus_Shipping.2008.19.dat
3  251 2008 20 Plus_Shipping.2008.20.dat
4  251 2008 22 Plus_Shipping.2008.22.dat
5  251 2008 23 Plus_Shipping.2008.23.dat
6  251 2008 24 Plus_Shipping.2008.24.dat

I can see the data has been correctly read.  But for some reason that isn't
clear, read.csv doesn't like the data in the last column.

 attach(optdata)
 for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE,
 na.strings=);x }
Error in read.table(file = file, header = header, sep = sep, quote = quote, 
: 
  'file' must be a character string or connection
 V4[[1]]
[1] Plus_Shipping.2008.18.dat
60 Levels: Easyway.2008.17.dat Easyway.2008.18.dat Easyway.2008.19.dat
Easyway.2008.20.dat ... Secured_Pay.2008.31.dat



The last column is comprised of valid Windows filenames (and no whitespace,
so as not to confuse things).

I see in the docuentation `[[...]]' is the operator used to select a single
element, whereas `[...]' is a general subscripting operator., so I assume
V4[[i]] is the correct way to get the ith value from V4.  So why does
read.csv complain that 'file' must be a character string or connection? 
It seems obvious that the value in V4[[i]i] is a string.  V4[[1]] does give
me the right value, although that is followed by output I didn't ask for.

In the loop above, I was going to replace the output obtained by 'x' with
output from fitdistr(x,exponential), but I can't proceed with that until I
can get the data in these files read.

What have I missed?

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/Argh%21--Trouble-using-string-data-read-from-a-file-tp20002064p20002064.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Argh! Trouble using string data read from a file

2008-10-15 Thread Ted Byers
Actually, I'd tried single brackets first.  Here is what I got:

 for (i in 1:length(V4) ) { x = read.csv(V4[i], header = FALSE, 
 na.strings=);x }
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
  'file' must be a character string or connection



the advice to use as.character worked, in that progress has been made.

Can you guys explain the following output, though?

 setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel)
 for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header = 
 FALSE, na.strings=);x }
 x
  V1
1  0
 x = read.csv(as.character(V4[[1]]), header = FALSE, na.strings=);x
 V1
1 0
2 0
321
4 0
5 1
6 7
751
820
9 3
105
116
128
132
140
152
164
17   23

Clearly, if I hand write a line to read the data, getting the file
name from V4 (in this case V4[[1]]), I get the data into 'x', which I
can then display.  I only displayed the first few as some of these
files will have thousands of values.

But what puzzles me is that I saw virtually no output from my loop.  I
thought what would happen (with the x after the ';') is that the
contents of each file would be displayed after it is read and before
the first is read.  And after the loop finishes, there is nothing in
x.  I don't see why the contents of x would disappear after the loop,
unless R has scoping restrictions as stringent as, say, C++ (e.g. a
variable declared inside a loop is not visible outside the loop).  But
that would beg the question as to how to declare a variable before it
is first used.

This doesn't bode well for me, or perhaps my ability to learn a new
trick at my age, when such a simple loop should give me such trouble.
:-(

Getting more grey hair by the minute.  :-(

Thanks

ted

On Wed, Oct 15, 2008 at 5:12 PM, Rolf Turner [EMAIL PROTECTED] wrote:

 On 16/10/2008, at 10:03 AM, jim holtman wrote:

 try putting as.character in the call:

 x = read.csv(as.character(V4[[i]]), header = FALSE

 No.  This won't help.  V4 is a column of the data frame optdata,
 and hence is a vector.  Not a list!  Use single brackets --- V4[i] ---
 and all will be well.

cheers,

Rolf

 On Wed, Oct 15, 2008 at 4:46 PM, Ted Byers [EMAIL PROTECTED] wrote:

 Here is what I tried:

 optdata =
 read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
 header = FALSE, na.strings=)
 optdata
 attach(optdata)
 for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE,
 na.strings=);x }

 And here  is the outcome (just a few of the 60 records successfully
 read):

 optdata =

 read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
 header = FALSE, na.strings=)
 optdata

   V1   V2 V3V4
 1  251 2008 18 Plus_Shipping.2008.18.dat
 2  251 2008 19 Plus_Shipping.2008.19.dat
 3  251 2008 20 Plus_Shipping.2008.20.dat
 4  251 2008 22 Plus_Shipping.2008.22.dat
 5  251 2008 23 Plus_Shipping.2008.23.dat
 6  251 2008 24 Plus_Shipping.2008.24.dat

 I can see the data has been correctly read.  But for some reason that
 isn't
 clear, read.csv doesn't like the data in the last column.

 attach(optdata)
 for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE,
 na.strings=);x }

 Error in read.table(file = file, header = header, sep = sep, quote =
 quote,
 :
  'file' must be a character string or connection

 V4[[1]]

 [1] Plus_Shipping.2008.18.dat
 60 Levels: Easyway.2008.17.dat Easyway.2008.18.dat Easyway.2008.19.dat
 Easyway.2008.20.dat ... Secured_Pay.2008.31.dat



 The last column is comprised of valid Windows filenames (and no
 whitespace,
 so as not to confuse things).

 I see in the docuentation `[[...]]' is the operator used to select a
 single
 element, whereas `[...]' is a general subscripting operator., so I
 assume
 V4[[i]] is the correct way to get the ith value from V4.  So why does
 read.csv complain that 'file' must be a character string or connection?
 It seems obvious that the value in V4[[i]i] is a string.  V4[[1]] does
 give
 me the right value, although that is followed by output I didn't ask for.

 In the loop above, I was going to replace the output obtained by 'x' with
 output from fitdistr(x,exponential), but I can't proceed with that
 until I
 can get the data in these files read.

 What have I missed?

 Thanks

 Ted
 --
 View this message in context:
 http://www.nabble.com/Argh%21--Trouble-using-string-data-read-from-a-file-tp20002064p20002064.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?

 __
 R-help@r-project.org

Re: [R] Argh! Trouble using string data read from a file

2008-10-15 Thread Ted Byers
Thanks Jim,

I hadn't seen the distinction between the commandline in RGui and what
happens within my code.

I have, however seen other differences I don't understand.  For
example, looking at the documentation for RScript, I see:

Rscript [options] [-e expression] file [args]

And the example:

Rscript -e 'date()' -e 'format(Sys.time(), %a %b %d %X %Y)'


So I tried it (Windows XP; R2.7.2), and this is what I got with just
copy directly from the documentation and pasting into the Windows
commandline window:

C:\Rscript -e 'date()' -e 'format(Sys.time(), %a %b %d %X %Y)'
[1] date()

C:\Rscript -e 'format(Sys.time(), %a %b %d %X %Y)'

C:\

But within RGui, I get:

 date();format(Sys.time(), %a %b %d %X %Y)
[1] Wed Oct 15 20:36:57 2008
[1] Wed Oct 15 8:36:57 PM 2008


Thanks again

Ted

On Wed, Oct 15, 2008 at 8:09 PM, jim holtman [EMAIL PROTECTED] wrote:
 You have to explicitly 'print' the value of x in the loop:print(x)

 'x' by itself is just it value.  At the command line, typing an
 objects name is equivalent to printing that object, but it only
 happens at the command line.  If you want a value printed, the 'print'
 it.  Also works at the command line if you want to use it there also.

 On Wed, Oct 15, 2008 at 5:36 PM, Ted Byers [EMAIL PROTECTED] wrote:
 Actually, I'd tried single brackets first.  Here is what I got:

 for (i in 1:length(V4) ) { x = read.csv(V4[i], header = FALSE, 
 na.strings=);x }
 Error in read.table(file = file, header = header, sep = sep, quote = quote,  
 :
  'file' must be a character string or connection



 the advice to use as.character worked, in that progress has been made.

 Can you guys explain the following output, though?

 setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel)
 for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header = 
 FALSE, na.strings=);x }
 x
  V1
 1  0
 x = read.csv(as.character(V4[[1]]), header = FALSE, na.strings=);x
 V1
 1 0
 2 0
 321
 4 0
 5 1
 6 7
 751
 820
 9 3
 105
 116
 128
 132
 140
 152
 164
 17   23

 Clearly, if I hand write a line to read the data, getting the file
 name from V4 (in this case V4[[1]]), I get the data into 'x', which I
 can then display.  I only displayed the first few as some of these
 files will have thousands of values.

 But what puzzles me is that I saw virtually no output from my loop.  I
 thought what would happen (with the x after the ';') is that the
 contents of each file would be displayed after it is read and before
 the first is read.  And after the loop finishes, there is nothing in
 x.  I don't see why the contents of x would disappear after the loop,
 unless R has scoping restrictions as stringent as, say, C++ (e.g. a
 variable declared inside a loop is not visible outside the loop).  But
 that would beg the question as to how to declare a variable before it
 is first used.

 This doesn't bode well for me, or perhaps my ability to learn a new
 trick at my age, when such a simple loop should give me such trouble.
 :-(

 Getting more grey hair by the minute.  :-(

 Thanks

 ted

 On Wed, Oct 15, 2008 at 5:12 PM, Rolf Turner [EMAIL PROTECTED] wrote:

 On 16/10/2008, at 10:03 AM, jim holtman wrote:

 try putting as.character in the call:

 x = read.csv(as.character(V4[[i]]), header = FALSE

 No.  This won't help.  V4 is a column of the data frame optdata,
 and hence is a vector.  Not a list!  Use single brackets --- V4[i] ---
 and all will be well.

cheers,

Rolf

 On Wed, Oct 15, 2008 at 4:46 PM, Ted Byers [EMAIL PROTECTED] wrote:

 Here is what I tried:

 optdata =
 read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
 header = FALSE, na.strings=)
 optdata
 attach(optdata)
 for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE,
 na.strings=);x }

 And here  is the outcome (just a few of the 60 records successfully
 read):

 optdata =

 read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
 header = FALSE, na.strings=)
 optdata

   V1   V2 V3V4
 1  251 2008 18 Plus_Shipping.2008.18.dat
 2  251 2008 19 Plus_Shipping.2008.19.dat
 3  251 2008 20 Plus_Shipping.2008.20.dat
 4  251 2008 22 Plus_Shipping.2008.22.dat
 5  251 2008 23 Plus_Shipping.2008.23.dat
 6  251 2008 24 Plus_Shipping.2008.24.dat

 I can see the data has been correctly read.  But for some reason that
 isn't
 clear, read.csv doesn't like the data in the last column.

 attach(optdata)
 for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE,
 na.strings=);x }

 Error in read.table(file = file, header = header, sep = sep, quote =
 quote,
 :
  'file' must be a character string or connection

 V4[[1]]

 [1] Plus_Shipping.2008.18.dat
 60 Levels: Easyway.2008.17.dat Easyway.2008.18.dat Easyway.2008.19.dat
 Easyway.2008.20.dat ... Secured_Pay.2008.31.dat



 The last column is comprised of valid Windows filenames (and no
 whitespace,
 so as not to confuse things

[R] Two last questions: about output

2008-10-15 Thread Ted Byers

Here is my little scriptlet:

optdata =
read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
header = FALSE, na.strings=)
attach(optdata)
library(MASS)
setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel)
for (i in 1:length(V4) ) { 
   x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=);
   y = x[,1];
   fp = fitdistr(y,exponential);
   print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd)) 
}


And here are the first few lines of output:

   rate rate 
2.51e+02 2.008000e+03 1.80e+01 6.869301e-02 6.462095e-03 
   rate rate 
2.51e+02 2.008000e+03 1.90e+01 5.958023e-02 4.491029e-03 
   rate rate 
2.51e+02 2.008000e+03 2.00e+01 8.631714e-02 7.428996e-03 
   rate rate 
2.51e+02 2.008000e+03 2.20e+01 1.261538e-01 1.137491e-02 
   rate rate 
2.51e+02 2.008000e+03 2.30e+01 1.339523e-01 1.332875e-02 
   rate rate 
2.51e+02 2.008000e+03 2.40e+01 8.916084e-02 1.248501e-02

There are only two things wrong, here.

1) the first three columns are integers, and are output variously as
integers, floating point numbers and, as shown here, in scientific notation.
2) this output isn't going to a file or to my DB.  This second issue isn't
much of a problem, as I think I know now how to deal with it.

This output data is, in one sense, perfectly organized, and there is a table
with a nearly identical structure (these five columns, plus one to hold the
date on which the analysis is performed (and of course, therefore, it has a
default value of the current timestamp  - handled in MySQL).  If I can get
the data written to a CSV file, with the first three columns provided as
integers, I can use the DB's bulk load utility to get the data into the DB,
and this may be faster than having this scriptlet connecting directly to the
DB to insert the data (unless the DBI has a function for a bulk load that
helps here).

Any idea how best to handle my formatting problem here?

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/Two-last-questions%3A-about-output-tp20005519p20005519.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Two last questions: about output

2008-10-15 Thread Ted Byers
Thanks Gabor,

I get how to make a frame using existing vectors.  In my example, the
following puts my first three columns into a frame (and displays it:

 testframe - data.frame(mid=V1,year=V2,week=V3)
 testframe
   mid year week
1  251 2008   18
2  251 2008   19
3  251 2008   20
4  251 2008   22
5  251 2008   23
6  251 2008   24
7  251 2008   25

I show the first of about 60 rows, and I am pleased that these values
appear as integers.

But what I don't see is how to add the fp$estimate,fp$sd values
obtained from my analyses to vectors to form the last two columns in
the data frame.  Is there something like a vector type, analogous to
the vector class std::vector from C++, that has a push_back function
allowing a vector to grow as new values are generated?

And suppose I have the following table in MySQL (ignoring for the
moment keys and indeces):

CREATE TABLE (
  id INTEGER  UNSIGNED NOT NULL auto_increment,
  mid INTEGER NOT NULL,
  y  INTEGER NOT NULL,
  w INTEGER NOT NULL,
  rate DOUBLE NOT NULL,
  sd DOUBLE NOT NULL
  process_date DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB;

How would I tell dbWriteTable() that my frame's five columns
correspond to mid,y,w,rate and sd in that order, and that the fields
id and process_date will take the appropriate default values?  Or do I
need a temporary table, in memory, that has only the five columns, and
use a stored procedure to move the data to its final home?

Thanks again,

Ted


On Wed, Oct 15, 2008 at 9:57 PM, Gabor Grothendieck
[EMAIL PROTECTED] wrote:
 Put the data in an R data frame and use dbWriteTable() to
 write it to your MySQL database directly.

 On Wed, Oct 15, 2008 at 9:34 PM, Ted Byers [EMAIL PROTECTED] wrote:

 Here is my little scriptlet:

 optdata =
 read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
 header = FALSE, na.strings=)
 attach(optdata)
 library(MASS)
 setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel)
 for (i in 1:length(V4) ) {
   x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=);
   y = x[,1];
   fp = fitdistr(y,exponential);
   print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd))
 }


 And here are the first few lines of output:

   rate rate
 2.51e+02 2.008000e+03 1.80e+01 6.869301e-02 6.462095e-03
   rate rate
 2.51e+02 2.008000e+03 1.90e+01 5.958023e-02 4.491029e-03
   rate rate
 2.51e+02 2.008000e+03 2.00e+01 8.631714e-02 7.428996e-03
   rate rate
 2.51e+02 2.008000e+03 2.20e+01 1.261538e-01 1.137491e-02
   rate rate
 2.51e+02 2.008000e+03 2.30e+01 1.339523e-01 1.332875e-02
   rate rate
 2.51e+02 2.008000e+03 2.40e+01 8.916084e-02 1.248501e-02

 There are only two things wrong, here.

 1) the first three columns are integers, and are output variously as
 integers, floating point numbers and, as shown here, in scientific notation.
 2) this output isn't going to a file or to my DB.  This second issue isn't
 much of a problem, as I think I know now how to deal with it.

 This output data is, in one sense, perfectly organized, and there is a table
 with a nearly identical structure (these five columns, plus one to hold the
 date on which the analysis is performed (and of course, therefore, it has a
 default value of the current timestamp  - handled in MySQL).  If I can get
 the data written to a CSV file, with the first three columns provided as
 integers, I can use the DB's bulk load utility to get the data into the DB,
 and this may be faster than having this scriptlet connecting directly to the
 DB to insert the data (unless the DBI has a function for a bulk load that
 helps here).

 Any idea how best to handle my formatting problem here?

 Thanks

 Ted
 --
 View this message in context: 
 http://www.nabble.com/Two-last-questions%3A-about-output-tp20005519p20005519.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Argh! Trouble using string data read from a file

2008-10-15 Thread Ted Byers
Thank you Prof. Ripley.

I appreciate this.

Have a good day.

Ted

On Thu, Oct 16, 2008 at 12:20 AM, Prof Brian Ripley
[EMAIL PROTECTED] wrote:
 On Wed, 15 Oct 2008, Ted Byers wrote:

 Thanks Jim,

 I hadn't seen the distinction between the commandline in RGui and what
 happens within my code.

 I have, however seen other differences I don't understand.  For
 example, looking at the documentation for RScript, I see:

 Rscript [options] [-e expression] file [args]

 And the example:

 Rscript -e 'date()' -e 'format(Sys.time(), %a %b %d %X %Y)'


 So I tried it (Windows XP; R2.7.2), and this is what I got with just
 copy directly from the documentation and pasting into the Windows
 commandline window:

 Your problem is the shell quoting: the Windows shell requires . E.g.

 C:\ d:/R/R-2.7.2/bin/Rscript -e date() -e format(Sys.time(), \%a %b %d
 %X %Y\)
 [1] Thu Oct 16 05:16:46 2008
 [1] Thu Oct 16 05:16:46 2008

 Other shells (e.g. bash, tcsh) do allow '', and indeed that is the preferred
 form there.  See ?shQuote .


 C:\Rscript -e 'date()' -e 'format(Sys.time(), %a %b %d %X %Y)'
 [1] date()

 C:\Rscript -e 'format(Sys.time(), %a %b %d %X %Y)'

 C:\

 But within RGui, I get:

 date();format(Sys.time(), %a %b %d %X %Y)

 [1] Wed Oct 15 20:36:57 2008
 [1] Wed Oct 15 8:36:57 PM 2008


 Thanks again

 Ted

 On Wed, Oct 15, 2008 at 8:09 PM, jim holtman [EMAIL PROTECTED] wrote:

 You have to explicitly 'print' the value of x in the loop:print(x)

 'x' by itself is just it value.  At the command line, typing an
 objects name is equivalent to printing that object, but it only
 happens at the command line.  If you want a value printed, the 'print'
 it.  Also works at the command line if you want to use it there also.

 On Wed, Oct 15, 2008 at 5:36 PM, Ted Byers [EMAIL PROTECTED] wrote:

 Actually, I'd tried single brackets first.  Here is what I got:

 for (i in 1:length(V4) ) { x = read.csv(V4[i], header = FALSE,
 na.strings=);x }

 Error in read.table(file = file, header = header, sep = sep, quote =
 quote,  :
  'file' must be a character string or connection



 the advice to use as.character worked, in that progress has been made.

 Can you guys explain the following output, though?

 setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel)
 for (i in 1:length(V4) ) { x = read.csv(as.character(V4[[i]]), header =
 FALSE, na.strings=);x }
 x

  V1
 1  0

 x = read.csv(as.character(V4[[1]]), header = FALSE, na.strings=);x

V1
 1 0
 2 0
 321
 4 0
 5 1
 6 7
 751
 820
 9 3
 105
 116
 128
 132
 140
 152
 164
 17   23

 Clearly, if I hand write a line to read the data, getting the file
 name from V4 (in this case V4[[1]]), I get the data into 'x', which I
 can then display.  I only displayed the first few as some of these
 files will have thousands of values.

 But what puzzles me is that I saw virtually no output from my loop.  I
 thought what would happen (with the x after the ';') is that the
 contents of each file would be displayed after it is read and before
 the first is read.  And after the loop finishes, there is nothing in
 x.  I don't see why the contents of x would disappear after the loop,
 unless R has scoping restrictions as stringent as, say, C++ (e.g. a
 variable declared inside a loop is not visible outside the loop).  But
 that would beg the question as to how to declare a variable before it
 is first used.

 This doesn't bode well for me, or perhaps my ability to learn a new
 trick at my age, when such a simple loop should give me such trouble.
 :-(

 Getting more grey hair by the minute.  :-(

 Thanks

 ted

 On Wed, Oct 15, 2008 at 5:12 PM, Rolf Turner [EMAIL PROTECTED]
 wrote:

 On 16/10/2008, at 10:03 AM, jim holtman wrote:

 try putting as.character in the call:

 x = read.csv(as.character(V4[[i]]), header = FALSE

 No.  This won't help.  V4 is a column of the data frame optdata,
 and hence is a vector.  Not a list!  Use single brackets --- V4[i] ---
 and all will be well.

   cheers,

   Rolf

 On Wed, Oct 15, 2008 at 4:46 PM, Ted Byers [EMAIL PROTECTED]
 wrote:

 Here is what I tried:

 optdata =

 read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
 header = FALSE, na.strings=)
 optdata
 attach(optdata)
 for (i in 1:length(V4) ) { x = read.csv(V4[[i]], header = FALSE,
 na.strings=);x }

 And here  is the outcome (just a few of the 60 records successfully
 read):

 optdata =


 read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
 header = FALSE, na.strings=)
 optdata

  V1   V2 V3V4
 1  251 2008 18 Plus_Shipping.2008.18.dat
 2  251 2008 19 Plus_Shipping.2008.19.dat
 3  251 2008 20 Plus_Shipping.2008.20.dat
 4  251 2008 22 Plus_Shipping.2008.22.dat
 5  251 2008 23 Plus_Shipping.2008.23.dat
 6  251 2008 24 Plus_Shipping.2008.24.dat

 I can see the data has been correctly read.  But for some reason that
 isn't
 clear, read.csv doesn't like

Re: [R] Two last questions: about output

2008-10-15 Thread Ted Byers
Thanks Gabor,

To be clear, would something like testframe$est[[i]] - fp$estimate be
valid within my loop, as in (assuming I created testframe before the
loop):

for (i in 1:length(V4) ) {
   x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=);
   y = x[,1];
   fp = fitdistr(y,exponential);
   print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd))
   testframe$est[[i]] - fp$estimate
   testframe$sd[[i]] - fp$sd
}

Thanks

Ted

On Thu, Oct 16, 2008 at 12:08 AM, Gabor Grothendieck
[EMAIL PROTECTED] wrote:
 testframe$newvar - ...whatever...
 (or see ?transform for another way)
 adds a new column to the data frame.   The table does not
 have to pre-exist in your MySQL database and you don't need
 a create statement; however, if the table does pre-exist the columns
 of your data frame and those of the database table should have the
 same names in the same order and use dbWriteTable(..., append = TRUE)


 On Wed, Oct 15, 2008 at 11:54 PM, Ted Byers [EMAIL PROTECTED] wrote:
 Thanks Gabor,

 I get how to make a frame using existing vectors.  In my example, the
 following puts my first three columns into a frame (and displays it:

 testframe - data.frame(mid=V1,year=V2,week=V3)
 testframe
   mid year week
 1  251 2008   18
 2  251 2008   19
 3  251 2008   20
 4  251 2008   22
 5  251 2008   23
 6  251 2008   24
 7  251 2008   25

 I show the first of about 60 rows, and I am pleased that these values
 appear as integers.

 But what I don't see is how to add the fp$estimate,fp$sd values
 obtained from my analyses to vectors to form the last two columns in
 the data frame.  Is there something like a vector type, analogous to
 the vector class std::vector from C++, that has a push_back function
 allowing a vector to grow as new values are generated?

 And suppose I have the following table in MySQL (ignoring for the
 moment keys and indeces):

 CREATE TABLE (
  id INTEGER  UNSIGNED NOT NULL auto_increment,
  mid INTEGER NOT NULL,
  y  INTEGER NOT NULL,
  w INTEGER NOT NULL,
  rate DOUBLE NOT NULL,
  sd DOUBLE NOT NULL
  process_date DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
 ) ENGINE=InnoDB;

 How would I tell dbWriteTable() that my frame's five columns
 correspond to mid,y,w,rate and sd in that order, and that the fields
 id and process_date will take the appropriate default values?  Or do I
 need a temporary table, in memory, that has only the five columns, and
 use a stored procedure to move the data to its final home?

 Thanks again,

 Ted


 On Wed, Oct 15, 2008 at 9:57 PM, Gabor Grothendieck
 [EMAIL PROTECTED] wrote:
 Put the data in an R data frame and use dbWriteTable() to
 write it to your MySQL database directly.

 On Wed, Oct 15, 2008 at 9:34 PM, Ted Byers [EMAIL PROTECTED] wrote:

 Here is my little scriptlet:

 optdata =
 read.csv(K:\\MerchantData\\RiskModel\\AutomatedRiskModel\\soptions.dat,
 header = FALSE, na.strings=)
 attach(optdata)
 library(MASS)
 setwd(K:\\MerchantData\\RiskModel\\AutomatedRiskModel)
 for (i in 1:length(V4) ) {
   x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=);
   y = x[,1];
   fp = fitdistr(y,exponential);
   print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd))
 }


 And here are the first few lines of output:

   rate rate
 2.51e+02 2.008000e+03 1.80e+01 6.869301e-02 6.462095e-03
   rate rate
 2.51e+02 2.008000e+03 1.90e+01 5.958023e-02 4.491029e-03
   rate rate
 2.51e+02 2.008000e+03 2.00e+01 8.631714e-02 7.428996e-03
   rate rate
 2.51e+02 2.008000e+03 2.20e+01 1.261538e-01 1.137491e-02
   rate rate
 2.51e+02 2.008000e+03 2.30e+01 1.339523e-01 1.332875e-02
   rate rate
 2.51e+02 2.008000e+03 2.40e+01 8.916084e-02 1.248501e-02

 There are only two things wrong, here.

 1) the first three columns are integers, and are output variously as
 integers, floating point numbers and, as shown here, in scientific 
 notation.
 2) this output isn't going to a file or to my DB.  This second issue isn't
 much of a problem, as I think I know now how to deal with it.

 This output data is, in one sense, perfectly organized, and there is a 
 table
 with a nearly identical structure (these five columns, plus one to hold the
 date on which the analysis is performed (and of course, therefore, it has a
 default value of the current timestamp  - handled in MySQL).  If I can get
 the data written to a CSV file, with the first three columns provided as
 integers, I can use the DB's bulk load utility to get the data into the DB,
 and this may be faster than having this scriptlet connecting directly to 
 the
 DB to insert the data (unless the DBI has a function for a bulk load that
 helps here).

 Any idea how best

[R] Getting frustrated with RMySQL

2008-10-14 Thread Ted Byers

Getting the basic stuff to work is trivially simple.  I can connect, and, for
example, get everything in any given table.

What I have yet to find is how to deal with parameterized queries or how to
do a simple insert (but not of a value known at the time the script is
written - I ultimately want to put my script into a scheduled task, so the
analysis can be repeated on updated data either daily or weekly).  

Using INSERT INTO myTable (a) VALUES (1) is simple enough, but what if I
want to insert a sample number (using, e.g. WEEK(sample_date) as a sample
identifier) along with the rate parameter estimated using fitdistr to fit an
exponential distribution to a dataset, along with its sd?  If I were using
Perl or Java, I'd set up the query similar to INSERT INTO myTable (a,b,c)
VALUES (?,?,?), and then use function calls to set each of the query
parameters.  I am having an aweful time finding the corresponding functions
in RMySQL.

And for the data, the simplest, and most efficient, way to get the data is
to use a statement like:

SELECT a,b,c FROM myTable GROUP BY g_id, WEEK(sdate);

The data is in MySQL, and my analysis needs to be applied independantly to
each group obtained from a query like this.  It appears I can't use a data
frame since none of the samples are of the same size (lets say the
probability of the samples being the same size in indistinguishable from 0). 
Is it possible to put the resultset from such a query into a list of vectors
that I can iterate over, passing each vector to fitdistr in turn?  If so,
how?

I know I can get this using Perl (by getting each sample individually and
writing it to a file, then having R read the file, do the analysis and write
the output to another file, and then have Perl parse the output file to
insert the parameter estimates I need into the appropriate table), but that
seems inefficient.

Is it possible to do all I need with R working directly with MySQL?  If so,
can someone fill in the apparent gaps left in the RMySQL documentation?

Thanks.

Ted
-- 
View this message in context: 
http://www.nabble.com/Getting-frustrated-with-RMySQL-tp19980592p19980592.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting frustrated with RMySQL

2008-10-14 Thread Ted Byers

Thanks Jeffrey and Barry,

I like the humour.  I didn't know about xkcd.com, but the humour on it is
familiar.  I saw little Bobby Tables what seems like eons ago, when I first
started cgi programming.

Anyway, I recognized the risk of an injection attack with this use of
sprint, but in this case, there is no risk because all the data used is
coming from previously sanitized data in our DB, and the parameters in this
case will invariably be integers.

Thanks again

Ted



Jeffrey Horner wrote:
 
 Barry Rowlingson wrote on 10/14/2008 04:40 PM:
 2008/10/14 Jeffrey Horner [EMAIL PROTECTED]:
 
 I've found the best way to parameterize is using R's sprintf function.
 For
 instance, the following query not only parameterizes the variable
 position,
 but also the table name:

 fields  - dbGetQuery(con,sprintf(select field,elem_label from %s_meta
 where field='%s',inp$pnid,inp$field))

 
  And thus a million web SQL injection exploits were born...
 
  Even if you do have control over the parameters to the query, you
 still have to worry about quotes or other nasty escape characters in
 your string ending up in the SQL. I hope little Bobby Tables isn't a
 subject in your analysis:
 
 Thank goodness I don't do analysis, as I haven't the schooling. Barry, 
 I'm ashamed of you! I was hoping you'd at least offer an alternative.
 
 http://xkcd.com/327/
 
 Okay, you are pardoned: I LOVE xkcd! Especially this one:
 
 http://xkcd.com/349/
 
 Best,
 
 Jeff
 -- 
 http://biostat.mc.vanderbilt.edu/JeffreyHorner
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Getting-frustrated-with-RMySQL-tp19980592p19983073.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting frustrated with RMySQL

2008-10-14 Thread Ted Byers

That is neat Gabor.  Thanks, Ted

Gabor Grothendieck wrote:
 
 The gsubfn package can do quasi perl-style interpolation by
 prefacing any function call with fn$.
 
 library(gsubfn)
 x - 3
 fn$dbGetQuery(con, select * from myTable where myColumnA = $x and
 MyColumnB = `2*x` )
 
 See http://gsubfn.googlecode.com
 
 
 On Tue, Oct 14, 2008 at 5:32 PM, Jeffrey Horner
 [EMAIL PROTECTED] wrote:
 Ted Byers wrote on 10/14/2008 02:33 PM:

 Getting the basic stuff to work is trivially simple.  I can connect,
 and,
 for
 example, get everything in any given table.

 What I have yet to find is how to deal with parameterized queries or how
 to
 do a simple insert (but not of a value known at the time the script is
 written - I ultimately want to put my script into a scheduled task, so
 the
 analysis can be repeated on updated data either daily or weekly).
 Using INSERT INTO myTable (a) VALUES (1) is simple enough, but what if
 I
 want to insert a sample number (using, e.g. WEEK(sample_date) as a
 sample
 identifier) along with the rate parameter estimated using fitdistr to
 fit
 an
 exponential distribution to a dataset, along with its sd?  If I were
 using
 Perl or Java, I'd set up the query similar to INSERT INTO myTable
 (a,b,c)
 VALUES (?,?,?), and then use function calls to set each of the query
 parameters.  I am having an aweful time finding the corresponding
 functions
 in RMySQL.

 I've found the best way to parameterize is using R's sprintf function.
 For
 instance, the following query not only parameterizes the variable
 position,
 but also the table name:

 fields  - dbGetQuery(con,sprintf(select field,elem_label from %s_meta
 where field='%s',inp$pnid,inp$field))

 Best,

 Jeff


 And for the data, the simplest, and most efficient, way to get the data
 is
 to use a statement like:

 SELECT a,b,c FROM myTable GROUP BY g_id, WEEK(sdate);

 The data is in MySQL, and my analysis needs to be applied independantly
 to
 each group obtained from a query like this.  It appears I can't use a
 data
 frame since none of the samples are of the same size (lets say the
 probability of the samples being the same size in indistinguishable from
 0). Is it possible to put the resultset from such a query into a list of
 vectors
 that I can iterate over, passing each vector to fitdistr in turn?  If
 so,
 how?

 I know I can get this using Perl (by getting each sample individually
 and
 writing it to a file, then having R read the file, do the analysis and
 write
 the output to another file, and then have Perl parse the output file to
 insert the parameter estimates I need into the appropriate table), but
 that
 seems inefficient.

 Is it possible to do all I need with R working directly with MySQL?  If
 so,
 can someone fill in the apparent gaps left in the RMySQL documentation?

 Thanks.

 Ted


 --
 http://biostat.mc.vanderbilt.edu/JeffreyHorner

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Getting-frustrated-with-RMySQL-tp19980592p19983099.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Applying an R script to data within MySQL? How to?

2008-10-08 Thread Ted Byers

I am trying something I haven't attempted before and the available
documentation doesn't quite answer my questions (at least in a way I can
understand).  My usual course of action would be to extract my data from my
DB, do whatever manipulation is necessary, either manually or using a C++
program, and then import the data into R.  Now I need to try to do it all
within R+RMySQL+MySQL.

I just managed to connect to MySQL and retrieve data using RMySQL as
follows:


 library(DBI)
 library(RMySQL)
 MySQL(max.con = 16, fetch.default.rec = 500, force.reload = F)
MySQLDriver:(3800) 
 m - dbDriver(MySQL)
 con - dbConnect(m, user=rejbyers, password = jesakos,
 host=localhost, dbname = merchants2)
 rs - dbSendQuery(con, select * from merchants)
 df - fetch(rs, n = 150)
 df

And of course, that last statement is followed by the entire contents of
merchants

Now, I have a script like the following:
refdata18 = read.csv(K:\\MerchantData\\RiskModel\\ndays18.csv,
na.strings=)
x1 = refdata18[,1]
library(MASS)
ex1 = fitdistr(x1,exponential)
str(ex1)


Now, the contents of ndaysXX.csv represent records where one of the date
values is in week XX of the current year. We don't yet have data spanning
multiple years, and will have to modify the SQL that gets the data
accordingly.  At present, my SQL statement groups records by WEEK of the
year, and then I manually separate weeks in a CSV file outside the DB.

Suppose I make a query like: SELECT ndays FROM xxx GROUP BY WEEK(tdate);

There is no a priori of knowing just how many weeks of data there are.

My reason for asking is I see information in the documentation about
dbApply(RMySQL) which says: Applies R functions to groups of remote DBMS
rows without bringing an entire result set all at once. The result set is
expected to be sorted by the grouping field.  There is an example, but the
example doesn't make much sense (the query used, for example, does not
contain a GROUP BY clause).

I can easily set up a table that could be used to manage the output I need
(primarily the rate value estimated for each week, and the SD of the
estimate), but at present I am at a loss as to how to proceed to set this
up.  

Can some kind soul out there give me rather pedantic instructions on how to
use RMySQL to apply, in my case fitdistr, independantly to each group of
values returned by my simplistic SQL query above, and insert the rate and sd
into another table?

I know I can handle all this using a perl script to create a suite of
temporary files, and process them one by one, but I have also been advised
to try to use R instead of Perl for this kind of task.

A slightly related question is this: Assuming I can get this all working
from within R, how would I make it a scheduled task on the one hand, or, on
the other hand, run it on demand from an event on a web page (which at
present is made using a combination of PHP, Apache's httpd server and MySQL,
if that matters)?  Of course, if I can make such an R script (or even store
it as a function) there should be no memory from one instance to another,
because the same analysis would have to be done on different users' data.

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/Applying-an-R-script-to-data-within-MySQL---How-to--tp19888407p19888407.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] What distribution is related to hypergeometric?

2008-09-25 Thread Ted Byers

I have been reading, in various sources, that a poisson distribution is
related to binomial, extending the idea to include numbers of events in a
given period of time.

In my case, the hypergeometric distribution seems more appropriate, but I
need a temporal dimension to the distribution.

I have weekly samples of two kinds of events: call them A and B.  I have a
count of A events.  These change dramatically from one week to the next.  I
also have weekly counts of B events that I can relate to A events.  Some
fraction 'lambda' (between 1 and 1) of A events will result in B events some
time in the future (but also sometimes in the same week that the related A
event occured).  The B event related to a given A event can occur as much as
ten weeks after the A event.  B events can not occur without a prior A
event, and well over half of the A events will never produce a B event. 
Also, we know that a given A event can not produce more than one B event. 
Hence hypergeometric is much more appropriate than binomial, and thus my
need for the distribution that has the same relation to the hypergeometric
that the poisson has to binomial.  Since hypergeometric is related to
binomial, would poisson also be related to hypergeometric?

My data is best expressed as a fraction: number of B events in a given week
divided by the number of A events producing the B events.  I.e. if there are
500 A events in week n, the data would be the number of related B events in
week m (m = n) divided by 500. and the first table I get from the DB has
records containing an ordered pair: week number, fraction.  E.g.

0,0.2
1,0.3
2,0.25
3,0.2
...

The above is dummy data, but the pattern I see in the data is that the
number of B events in week 0 is less than the number of B events in week 1,
but from then on, the number of B events declines exponentially (as you'd
expect from what could be described as a decay process, altered to reflect
the fact that over half of the original A events will never produce B
events).  Of all the distributions I tried on this data, exponential and
poisson produced the best fits, with very little to choose between them.

Always, the cumulative fraction of A events that have produced B events
approaches an asymptote between 0.25 and 0.45.  Never higher, but now it
looks like the asymptotes are getting smaller (the behaviour of the system
is changing).

In a sense, this breaks down into two questions:  
1) What distribution should I try to fit to my data?  
2) How do I present my data to the functions that will try to fit the
distribution to this data?

The reason for the second is that, while I have examined lots of functions
(fBasics, MASS, c.) that will try to fit a distribution to data, they all
seem to expect a 1D vector of data and none of them say anything about the
data, or what to do if you already have an empirical (cumulative)
distribution.  

To try out the functions that fit distributions, I created a dummy vector
where the initial sample size was 1000, and the number of values equal to a
given week number would be 1000 * the faction of A events that produced B
events.  E.g. (using the sample numbers above, there'd be 200 '0's, 300
'1's, 250 '2's, c.)

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/What-distribution-is-related-to-hypergeometric--tp19671054p19671054.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] What distribution is related to hypergeometric?

2008-09-25 Thread Ted Byers

 I have weekly samples of two kinds of events: call them A and B.  I have a
count of A events.  These 
 change dramatically from one week to the next.  I also have weekly counts
 of B events that I can relate
 to A events.  Some fraction 'lambda' (between 1 and 1) of A events will
 result in B events some time in 

OOPS, that OUGHT to have been between 0 and 1.

Ted
-- 
View this message in context: 
http://www.nabble.com/What-distribution-is-related-to-hypergeometric--tp19671054p19671301.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Please help me interpret these results (fitting distributions to real data)

2008-09-25 Thread Ted Byers

I just thought of a useful metaphore for the problem I face.  I am dealing
with a problem in business finance, with two kinds of related events. 
However, imagine you have a known amount of carbon (so many kilograms), but
you do not know what fraction is C14 (and thus radioactive).  Only the C14
will give decay events (and once that event has occurred, the atom that
decayed will never decay again).  C12 will never decay.  What you want to
know is a) what is the ratio of C12 to C14 at time 0, and b) how many decay
events will happen between time X and time y, or how many decay events will
happen after time z.  That integral, is, IIRC, quite simple.

The data you get from your equipment will be a number of decay events in
time period n (could be a specific week or a specific day).  How would you
get this data into R so that you can use, say, fitdistr(MASS) to estimate
the decay rate, and then proceed to answer the questions of interest?

Anyway, in my early tests (before I figured out which distribution is most
appropriate in this case), I got the following results (this is for one
week's data, but other weeks' result are similar).

==curious results=
 ex15 = fitdistr(x15,exponential)
 str(ex15)
List of 4
 $ estimate: Named num 0.0653
  ..- attr(*, names)= chr rate
 $ sd  : Named num 0.00356
  ..- attr(*, names)= chr rate
 $ n   : int 337
 $ loglik  : num -1256
 - attr(*, class)= chr fitdistr
 ge15 = fitdistr(x15,geometric)
 str(ge15)
List of 4
 $ estimate: Named num 0.0613
  ..- attr(*, names)= chr prob
 $ sd  : Named num 0.00324
  ..- attr(*, names)= chr prob
 $ n   : int 337
 $ loglik  : num -1257
 - attr(*, class)= chr fitdistr
 po15 = fitdistr(x15,poisson)
 str(po15)
List of 4
 $ estimate: Named num 15.3
  ..- attr(*, names)= chr lambda
 $ sd  : Named num 0.213
  ..- attr(*, names)= chr lambda
 $ n   : int 337
 $ loglik  : num -2721
 - attr(*, class)= chr fitdistr
 nb15 = fitdistr(x15,negative binomial)
Warning messages:
1: In dnbinom(x, size, prob, log) : NaNs produced
2: In dnbinom(x, size, prob, log) : NaNs produced
3: In dnbinom(x, size, prob, log) : NaNs produced
 str(nb15)
List of 4
 $ estimate: Named num [1:2]  0.973 15.309
  ..- attr(*, names)= chr [1:2] size mu
 $ sd  : Named num [1:2] 0.0786 0.8719
  ..- attr(*, names)= chr [1:2] size mu
 $ loglik  : num -1267
 $ n   : int 337
 - attr(*, class)= chr fitdistr
 AIC(ex15)
[1] 2514.952
 AIC(ge15)
[1] 2516.273
 AIC(po15)
[1] 5444.62
 AIC(nb15)
[1] 2538.385

=end curious results=

Notice that the AIC for the exponential and geometric distributions are
almost idential, and that for the negative binomial is not much different.

This now makes some sense; the geometric being a discrete equivalent of the
exponential, as well as being a special case of the negative binomial. 
Right?  With such relationships among them, it would not be surprising to
see them give similar values of AIV.  Right?

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/Please-help-me-interpret-these-results-%28fitting-distributions-to-real-data%29-tp19678782p19678782.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Statistical question re assessing fit of distribution functions.

2008-09-23 Thread Ted Byers

Thanks Timur

While assessing whether or not the best option would be a normal
distribution (it won't be, the data in this case LOOKS more poisson, or if I
explude the first week of results, a negative exponential; and in my other
case, cauchy is more likely), I really need a test that can be applied
regardless of the distribution to see which distribution fits best.  Using
log-likelihood, there doesn't seem to be much to choose between exponential
and poisson (the log-likelihhod for them being almost the same, regardless
of the sample even tough the parameters are very different from one sample
to the next - I don't understand why yet), and the others I have tried are
MUCH worse, but I'm not done yet.

Are you aware of functions that allow estimation of all the parameters of a
non-central distribution?  I ask because a problem I'll be working on in a
few weeks will involve the kind of skew produced by a non-central
distribution (among others).  I see some functions allow you to work with
skewed distributions (e.g. [dpqr]stable  the skewed stable distribution )
but I have not yet found functions that alow one to estimate their
parameters from real data.

Thanks,

Ted

Timur Shtatland wrote:
 
 If one of the goals is the normality test, then there may be better
 alternatives to the Kolmogorov-Smirnov test.
 See an explanation on:
 http://graphpad.com/FAQ/viewfaq.cfm?faq=959
 
 The R implementation:
 ?shapiro.test
 
 A casual search also turned this up:
 http://tolstoy.newcastle.edu.au/R/help/04/09/3201.html
 http://tolstoy.newcastle.edu.au/R/help/04/08/3121.html
 http://www.karlin.mff.cuni.cz/~pawlas/2008/MAI061/dagost.R
 
 Best,
 
 Timur
 --
 Timur Shtatland, Ph.D.
 Senior Bioinformatics Scientist
 Agencourt Bioscience Corporation - A Beckman Coulter Company
 500 Cummings Center, Suite 2450
 Beverly, MA 01915
 www.agencourt.com
 
 On Mon, Sep 22, 2008 at 12:26 PM, Ted Byers [EMAIL PROTECTED] wrote:

 I am in a situation where I have to fit a distrution, such as cauchy or
 normal, to an empirical dataset.  Well and good, that is easy.

 But I wanted to assess just how good the fit is, using ks.test.

 I am concerned about the following note in the docs (about the example
 provided):  Note that the distribution theory is not valid here as we
 have
 estimated the parameters of the normal distribution from the same sample

 This implies I should not use ks.test(x,pnorm,mean =1.187, sd =0.917),
 where the numbers shown are estimated from 'x'.  If this is so, how do I
 get
 a correct test?  I know I can not use different samples because of just
 how
 different the parameters are from one sample to the next, so using
 parameters estimated from the sample from week one to define the
 distribution function for ks.test will give a poor fit for the data from
 week two.  And the sample size is small enough that I would not have
 confidence in the parameters estimated from a portion of a samlpe to fit
 against the remainder of the sample.

 Thanks

 Ted

 --
 View this message in context:
 http://www.nabble.com/Statistical-question-re-assessing-fit-of-distribution-functions.-tp19611539p19611539.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Statistical-question-re-assessing-fit-of-distribution-functions.-tp19611539p19629108.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Trouble understanding the behaviour of stableFit(fBasics)

2008-09-23 Thread Ted Byers

Can anyone explain such different output:

 stableFit(s,alpha = 1.75, beta = 0, gamma = 1, delta = 0, 
+ type = c(q, mle), doplot = TRUE, trace = FALSE, title = NULL, 
+ description = NULL)

Title:
 Stable Parameter Estimation 

Call:
 .qStableFit(x = x, doplot = doplot, title = title, description =
description)

Model:
 Student-t Distribution

Estimated Parameter(s):
 alpha   beta  gamma  delta 
 1.534  0.275  0.3211991 -0.9922306 

Description:
 Tue Sep 23 22:18:44 2008 by user: Ted 

 refdata18 = read.csv(C:\\MerchantData\\RiskModel\\Capture.Week.18.csv,
 na.strings=)
 stableFit(refdata18[,1],alpha = 1.75, beta = 0, gamma = 1, delta = 0, 
+ type = c(q, mle), doplot = TRUE, trace = FALSE, title = NULL, 
+ description = NULL)

Title:
 Stable Parameter Estimation 

Call:
 .qStableFit(x = x, doplot = doplot, title = title, description =
description)

Model:
 Student-t Distribution

Estimated Parameter(s):
alpha  beta gamma delta 
   NANANANA 

Description:
 Tue Sep 23 22:20:23 2008 by user: Ted 

 


I am just playing with it right now, trying to understand how to call it, so
first I passed the s vector from the example.  I don't care about the result
except to know that stableFit accepted the input and obtained an estimate
for the parameters.

The I tried my data (a vector in integers, with a distribution that looks
similar to  poisson, but exponential and geometric give better fits).

What I find puzzling is that I get no error messages complaining about one
property or another of my data, to explain why there are no parameter
estimates.  The data I WILL be applying this to comes from the financial
markets, and will be reals or floating point numbers that in some cases wil
be best modelled by a normal distribution while in most cases, the
distribution will be closer to cauchy.  (but DistributionFits(fBasics) makes
no explicit mention of cauchy,  but IIRC cauchy is a special case of a
stable distribution one of a family - are these the L-stable distributions
Mandelbrot discussed, or something else - correct me if my memory has failed
me sooner than anticipated ;-)  An URL for a website discussing these in
some detail would be handy as my stats texts, dated as they are and focussed
more on applied biometrics, don't talk about these.

What do I look at if this function just gives me a bunch of 'NA's instead of
parameter estimates?

And, givent he structure of the documentation, it is not clear if I can get
an estimate of skewness for all the distributions or for all except t and
normal distributions if I am using DistributionFits.

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/Trouble-understanding-the-behaviour-of-stableFit%28fBasics%29-tp19640972p19640972.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Re lative novice: Working with fitdistr(MASS): 3 questions

2008-09-22 Thread Ted Byers

OK, I am now at the point where I can use fitdistr to obtain a fit of one of
the standard distributions to mydata.

It is quite remarkable how different the parameters are for different
samples through from the same system.  Clearly the system itself is not
stationary.

Anyway, question 1:  I require a visual perspective of the fit I get.  I can
use hist.scott to get a hisogram (and just have to figure out how to get
finer granularity from it - my samples are taken weekly, but the histogram
bars cover two weeks of data and the most interesting changes happen in the
first three to four weeks - after that things slow down tremendously), but
how would I overlay a plot of the best distribution I get from fitdistr over
it?

Second question: I don't see anything in the documentation for fitdistr that
says anything about using the distribution obtained to integrate the
distribution over some range of values.  I get weekly sampled, and for each
sample I get a certain number of events each week for about three months.  I
need to be able to use the distribution to estimate the number of such
events next week or the week after, and how long it will be that the
probability of such an event is so low that no more of them are likely to be
observed from that sample ever.  What package or functions should I be
looking at here to get this done?

Third question: I see nothing in the docs about non-central distributions. 
The distribution most likely to fit is cauchy, but we know that there is
skew that depends on the magnitude: large positive deviates are more common
that large negative deviates, but extremely large positive deviates are less
common that extremely large negative deviates.  What we don't know is how
significant such skewness is for the overall distribution.  How can I assess
this, or can I assess this, using fitdistr (or some other function I haven't
found yet)?

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/Relative-novice%3A-Working-with-fitdistr%28MASS%29%3A-3-questions-tp19610812p19610812.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Statistical question re assessing fit of distribution functions.

2008-09-22 Thread Ted Byers

I am in a situation where I have to fit a distrution, such as cauchy or
normal, to an empirical dataset.  Well and good, that is easy.

But I wanted to assess just how good the fit is, using ks.test.

I am concerned about the following note in the docs (about the example
provided):  Note that the distribution theory is not valid here as we have
estimated the parameters of the normal distribution from the same sample

This implies I should not use ks.test(x,pnorm,mean =1.187, sd =0.917),
where the numbers shown are estimated from 'x'.  If this is so, how do I get
a correct test?  I know I can not use different samples because of just how
different the parameters are from one sample to the next, so using
parameters estimated from the sample from week one to define the
distribution function for ks.test will give a poor fit for the data from
week two.  And the sample size is small enough that I would not have
confidence in the parameters estimated from a portion of a samlpe to fit
against the remainder of the sample.

Thanks

Ted

-- 
View this message in context: 
http://www.nabble.com/Statistical-question-re-assessing-fit-of-distribution-functions.-tp19611539p19611539.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why isn't R recognising integers as numbers?

2008-09-21 Thread Ted Byers

Thanks Jim,

Alas, it wasn't this.  Here is the output from both of your suggestions:

 refdata18 = read.csv(K:\\MerchantData\\RiskModel\\Capture.Week.18.csv,
 header = TRUE,na.strings=)
 str(refdata18)
'data.frame':   341 obs. of  1 variable:
 $ X0: int  0 0 0 0 0 0 0 0 0 0 ...
 scan(K:\\MerchantData\\RiskModel\\Capture.Week.18.csv, what=0L)
Read 342 items
  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
0  0
 [26]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
0  0
 [51]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
0  0
 [76]  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
1  1
[101]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
1  1
[126]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
1  1
[151]  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
2  2
[176]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3 
3  3
[201]  3  3  3  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4 
4  4
[226]  4  4  4  4  4  4  4  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6 
6  6
[251]  6  6  6  6  6  6  6  6  6  6  6  6  6  6  7  7  7  7  7  7  7  7  7 
7  7
[276]  7  7  7  8  8  8  8  9  9  9  9  9  9  9  9  9 10 10 10 10 10 10 10
10 10
[301] 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12
12 12
[326] 12 12 12 18 18 18 18 18 18 18 18 18 18 18 18 18 18

Thanks anyway.

Ted
 

jholtman wrote:
 
 best guess is that they are not integers.  Do 'str' on your object and it
 probably says they are 'factors'.  This is probably due to some of your
 data
 being non-numeric.  Try using 'colClasses' on read.csv to specify what the
 column should contain.  Also try scan after skipping the first record if
 it is a header:
 
 scan(, what=0L)  # bad input after specifying integer
 1: 1 2 3 4
 5: 1 v
 5:
 Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
 :
   scan() expected 'an integer', got 'v'
 scan(, what=0L)  # good input
 1: 1
 2: 2
 3: 3
 4:
 Read 3 items
 [1] 1 2 3

 
 On Sun, Sep 21, 2008 at 9:01 PM, Ted Byers [EMAIL PROTECTED] wrote:
 

 I have a number of files containing anywhere from a few dozen to a few
 thousand integers, one per record.

 The statement refdata18 =
 read.csv(K:\\MerchantData\\RiskModel\\Capture.Week.18.csv, header =
 TRUE,na.strings=) works fine, and if I type refdata18, I get the
 integers
 displayed, one value per record (along with a record number).  However,
 when
 I try  fitdistr(refdata18,negative binomial), or
 hist.scott(refdata18,
 prob = TRUE), I get an error:

 Error in fitdistr(refdata18, negative binomial) :
  'x' must be a non-empty numeric vector
 Or
 Error in hist.default(x, nclass.scott(x), prob = prob, xlab = xlab, ...)
 :
  'x' must be numeric

 How can it not recognise integers as numbers?

 Thanks

 Ted
 --
 View this message in context:
 http://www.nabble.com/Why-isn%27t-R-recognising-integers-as-numbers--tp19600308p19600308.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 
 
 
 -- 
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390
 
 What is the problem that you are trying to solve?
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Why-isn%27t-R-recognising-integers-as-numbers--tp19600308p19600695.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why isn't R recognising integers as numbers?

2008-09-21 Thread Ted Byers

Thanks Marc,

That was it. 

For the last 30 years, I'd write my own code, in FORTRAN, C++, or even Java,
to do whatever statistical analysis I needed.  When at the office, sometimes
I could use SAS, but that hasn't been an option for me in years.

This is the first time I have had to load real data into R (instead of
generating random data to use while playing with some of the stats
functions, or manually typing dummy data).

I take it, then, that the result of loading data is a data frame, and not
just a matrix or array.  Using something like refdata18[, 1] feels rather
alien, but I'm sure I'll quickly get used to it.  I'd seen it before in the
R docs, but it didn't register that I had to use it to get the functions of
most interest to me to recognise my data as a vector of numbers, given I'd
provided only a vector of integers as input.

Thanks

Ted


Marc Schwartz wrote:
 
 on 09/21/2008 08:01 PM Ted Byers wrote:
 I have a number of files containing anywhere from a few dozen to a few
 thousand integers, one per record.
 
 The statement refdata18 =
 read.csv(K:\\MerchantData\\RiskModel\\Capture.Week.18.csv, header =
 TRUE,na.strings=) works fine, and if I type refdata18, I get the
 integers
 displayed, one value per record (along with a record number).  However,
 when
 I try  fitdistr(refdata18,negative binomial), or
 hist.scott(refdata18,
 prob = TRUE), I get an error:
 
 Error in fitdistr(refdata18, negative binomial) : 
   'x' must be a non-empty numeric vector
 Or
 Error in hist.default(x, nclass.scott(x), prob = prob, xlab = xlab, ...)
 : 
   'x' must be numeric
 
 How can it not recognise integers as numbers?
 
 Thanks
 
 Ted
 
 'refdata18' is a data frame and the two functions are expecting a
 numeric vector.
 
 If you use:
 
   fitdistr(refdata18[, 1], negative binomial)
 
 or
 
   hist(refdata18[, 1])
 
 you should get a suitable result, presuming that the first column in the
 data frame is a numeric vector.
 
 Use:
 
   str(refdata18)
 
 to get a sense for the structure of the data frame, including the column
 names, which you could then use, instead of the above index based syntax.
 
 HTH,
 
 Marc Schwartz
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Why-isn%27t-R-recognising-integers-as-numbers--tp19600308p19600803.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Novice question about getting data into R

2008-09-19 Thread Ted Byers

I found it easy to use R when typing data manually into it.  Now I need to
read data from a file, and I get the following errors:

 refdata =
 read.table(K:\\MerchantData\\RiskModel\\refund_distribution.csv, header
 = TRUE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, 
: 
  line 1 did not have 42 elements
 refdata =
 read.table(K:\\MerchantData\\RiskModel\\refund_distribution.csv)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, 
: 
  line 2 did not have 42 elements


(I'd tried the first version above because the first record has column
names.)

First, I don't know why R expects 42 elements in a record.  
There is one column for a time variable (weeks since a given week of samples
were taken) and one for each week of sampling in the data file (Week 18
through Week 37 inclusive).  And there is only 19 rows.
The samples represented by the columns are independant, and the numbers in
the columns are the fraction of events sampled that result in an event of
another kind in the week since the sample was taken.

The samples are not the same size, and starting with week 20, the number of
values progressively gets smaller since there have been fewer than 37  weeks
since the samples were taken.

I can show you the contents of the data file if you wish.  It is
unremarkable, csv, with strings used for column names enclosed in double
quotes.

I don't have to manually separate the samples into their own files do I?  I
was hoping to write a function that estimates the density function that best
fits each sample individually, and then iterate of the columns, applying
that function to each in turn.

What is the best way to handle this?

Thanks

Ted


-- 
View this message in context: 
http://www.nabble.com/Novice-question-about-getting-data-into-R-tp19576065p19576065.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Novice question about getting data into R

2008-09-19 Thread Ted Byers

Thanks one and all.

Actually, I used OpenOffice's spreadsheet to creat the csv file, but I have
been using it long enough to know to specify how I wanted it, and sometimes,
when that proves annoying, I'll use Perl to finess it the way I want it.

It seems my principle error was to assume that it would ignore the character
strings within the double quotes and determine fields based on the commas. 
Silvia's remarks about empty cells and blanks in the middle of column names
were right on the mark.

Tom, I appreciate the caveats you mention.  I am aware of the complications
of i18n, but they don't affect me much as my stuff is run exclusively in
Canada (pretty much the same norms as the US).  They don't affect me (in a
sense because I have manipuated data around such issues using perl in order
to satisfy the peculiarities of the software used on one project or another
- I deal with it almost as a matter of course, as long as I already know the
peculiarities of the software I am working with), and I have plenty of
experience moving data between spreadsheets, RDBMS such as MS SQL,
PostgreSQl, MySQL, and XML files, and have had to resort to unusual
delimiters in the past because of peculiarities in the data feed.  While I
have tonnes of experience developing software (C++, Java, FORTRAN, perl) I
only started playing with R a few months ago, and this is the first I have
had to import real data into it.  While the tutorials I found were useful,
it seems there are key tidbits of information I need scattered through the
documentation and I am finding it challenging to find the peculiarities of
R.

Thanks again one and all.

Ted



Tom Backer Johnsen wrote:
 
 Silvia Lomascolo wrote:
 
 refdata =
 read.table(K:\\MerchantData\\RiskModel\\refund_distribution.csv,
 header
 = TRUE)
 Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
 na.strings, 
 : 
   line 1 did not have 42 elements
 refdata =
 read.table(K:\\MerchantData\\RiskModel\\refund_distribution.csv)
 Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
 na.strings, 
 : 
   line 2 did not have 42 elements
 R interprets that you have 42 columns from the variable names. Do you?
 See
 if removing spaces between column names helps (e.g., week.1 instead of
 week 1).  Also, because yours is a csv file, fields are separated by
 comas.  You can either use the read.csv command instead of the
 read.table (see ?read.table for details), or add the argument sep=,
 to
 tell R that fields are separated by comas.  You might also need to
 specify,
 if you have empty cells, what to do with them (e.g., na.strings=)
 
 You are of course right about the NA's (missing values, empty cells) as 
 well as the possible blanks in the column names.  It might nevertheless 
 be a good idea for him to at least submit a few of the lines at the top 
 of the file.  A .csv file as generated by Excel on Windows is not 
 necessarily comma-separated.  That depends on the list separator 
 setting under Regional Language Settings found in the Control Panel. 
 On my machine, the list separator is a semicolon for a .csv file.  The 
 reason is simple, in Norway, the standard decimal separator is a comma, 
 and you do not want to confuse the system too much.  So, that particular 
 point is dependent on the settisngs for his locale (language, country).
 
 Tom
 
 
 
 
 
 
 -- 
 ++
 | Tom Backer Johnsen, Psychometrics Unit,  Faculty of Psychology |
 | University of Bergen, Christies gt. 12, N-5015 Bergen,  NORWAY |
 | Tel : +47-5558-9185Fax : +47-5558-9879 |
 | Email : [EMAIL PROTECTED]URL : http://www.galton.uib.no/ |
 ++
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Novice-question-about-getting-data-into-R-tp19576065p19577763.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Use of distribution model to estimate probability of an event

2008-09-18 Thread Ted Byers

I have a situation where there ae two kinds of events: A and B.  B does not
occur without A occuring first, and a percentage of A events lead to an
event B some time later and the remaining ones do not.  I have n independant
samples, with a frequency of events B by week, until event B for a given
week's A events no longer happen (after about 10 weeks, the chance of
another B event is less than 0.1%).  That gives me good enough data to
determine which distribution fits the data.  But looking at the data for
several weeks of A event, it is clear that although the distributions have a
similar shape (e.g. the corresponding B events peak on week two), there are
significant differences between weeks of A events regarding the fraction of
them that lead to B events (sometimes it is 25% and sometimes it is 45%,
with dozens of values in between being observed).

I know how to use R to fit the distributions.

The question is, once I have fit a distribution to the data (i.e. I know the
distribution and it s parameters that give the best fit, is there a function
in R that I can use to obtain the number of events of type B will occur in
week M (knowing the number of A events, and a density function, all I need
is the probability of a B event in the week of interest - a simple forecast
since the week for which we want the answer hasn't come yet - a simple
forecast model), given a number of  A events in a prior week N?  If so, just
tell me the name of the function and the package, and I'll find it and read
up on it.  This is for the development of a model of risk (A events being
desirable and B events representing a cost to all concerned).

It is a simple enough model, but I am having a little trouble finding the
last piece of the puzzle that I need.

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/Use-of-distribution-model-to-estimate-probability-of-an-event-tp19565047p19565047.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] library/function that estimates parameters of well known distributions from empirical data?

2008-09-05 Thread Ted Byers

Thanks Ben

That was the one I'd remembered but couldn't find.

Mark Leeds also told me about DistributionFits(fBasics), which I hadn't
seen.  There seems to be only a little overlap between the two.

Could I trouble you to expand on AIC (esp. what the function name and
package is to apply it to the output from these two functions)?  I just read
the help provided for each and neither mentions AIC.

Thanks again Ben

Ted


Ben Bolker wrote:
 
 Ted Byers r.ted.byers at gmail.com writes:
 
 
 
 I found this a few months ago, but for the life of me I can't remember
 what
 the function or package was, and I have had no luck finding it this week.
 
 I have found, again, the functions for working with distributions like
 Cauchy, F, normal, c., and ks.test, but I have not found the functions
 for
 estimating the distribution parameters given a vector of values.
 
 
   Look at the fitdistr function in the MASS package.  Consider
 AIC comparisons for ranking the fits to these non-nested models.
 
   good luck
Ben Bolker
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/library-function-that-estimates-parameters-of-well-known-distributions-from-empirical-data--tp19323700p19339442.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] library/function that estimates parameters of well known distributions from empirical data?

2008-09-04 Thread Ted Byers

I found this a few months ago, but for the life of me I can't remember what
the function or package was, and I have had no luck finding it this week.

I have found, again, the functions for working with distributions like
Cauchy, F, normal, c., and ks.test, but I have not found the functions for
estimating the distribution parameters given a vector of values.

What I need to do is estimate the distribution parameters for each candidate
distribution, and then test to see which gives the best fit to the data.

I want to examine the question, given this dataset (which may have thousands
of records), does the normal or cauchy distribution fit the data best, and
which what parameters.  It will not be known a priori whether or not the
most appropriate distribution is non-central, though we do know that often
(not always) values of medium size in absolute value are more often positive
than negative and that very large values are more often negative than
positive.

Could someone please give me a gentle reminder of the package and
function(s) I ought to be examining?

Thanks

Ted
-- 
View this message in context: 
http://www.nabble.com/library-function-that-estimates-parameters-of-well-known-distributions-from-empirical-data--tp19323700p19323700.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] license for a university

2008-09-03 Thread Ted Byers

Erin,

I trust you know what you risk when you assume.  ;-)

There IS a license, but it basically lets you copy or distribute it, or, in
your case, install on as many machines as you wish.  It is the GNU GENERAL
PUBLIC LICENSE.

Like most open source software I use, the Gnu license is in place primarly
to ensure everyone can freely use it.

Cheers

Ted

Erin Hodgess-2 wrote:
 
 Dear R People:
 
 I am trying to install R in a classroom here, but have been told that
 there must be a license.
 
 Is there such a thing with R, please?  Since it is free, I assumed
 that there would be no license.
 
 Thanks for any help,
 Sincerely,
 Erin
 
 
 -- 
 Erin Hodgess
 Associate Professor
 Department of Computer and Mathematical Sciences
 University of Houston - Downtown
 mailto: [EMAIL PROTECTED]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
-- 
View this message in context: 
http://www.nabble.com/%22license%22-for-a-university-tp1928p19300187.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.