Re: [R] Olympics: 200m Men Final

2012-08-10 Thread R. Michael Weylandt
Continuing on with fun, if silly, analyses: a little voice in my head
suggests a time series model and, rather than putting any thought
into, I'll use some R-goodness.

Setting up the data as Rui provided, we need to add some NA's to
account for WWII:

library(zoo)
golddata.ts - as.ts(zoo(golddata[,6], order.by = golddata[,1]))
# Here we let Gabor and Achim think about how get the NAs in there smoothly

library(forecast)
golddata.model - auto.arima(golddata.ts)

# Prof Hyndman has forgotten more about time series than I will ever know

summary(golddata.model) # ARIMA(2,1,0)+drift seems a bit heavy handed,
but that's what it gives
forecast(golddata.model, 1)

  Point ForecastLo 80Hi 80Lo 95  Hi 95
2012   19.66507 19.23618 20.09396 19.00915 20.321

# But looking at a graph, this seems to have an odd jump up
plot(forecast(golddata.model))

# Maybe we overfit -- let's kill the drift

golddata.model2 - auto.arima(golddata.ts, allowdrift = FALSE)

summary(golddata.model2) # ARIMA(1,1,0) seems better
plot(forecast(golddata.model2)) # I like the graph more too
forecast(golddata.model, 2)

   Point ForecastLo 80Hi 80Lo 95Hi 95
2012   19.56139 19.04134 20.08145 18.76604 20.35674

Not so very good at all, but a little bit of R fun nevertheless ;-)

And in the category of how good is your prediction when you already
know the answer and don't care at all about statistical rigor, it
seems that regress on year might still be winning. Anyone want to
take some splines out for a spin?

Cheers,
Michael

On Thu, Aug 9, 2012 at 11:31 PM, Mark Leeds marklee...@gmail.com wrote:
 Hi Rui: I hate to sound like a pessimist/cynic and also I should state that
 I didn't look
 at any of the analysis by you or the other person. But, my question, ( for
 anyone who wants to chime in ) is: given that all these olympic 100-200
 meter runners post times that are generally within 0.1-0.3 seconds of each
 other or even less, doesn't it stand to reason that a model, given the
 historical times, is going to predict well. I don't know what the
 statistical term is for this but intuitively, if there's extremely little
 variation in the responses, then there's going to be extremely little
 variation in the predictions and the result is that you won't be too far
 off ever as long as your predictors are not too strange.  !   ( weight,
 past performances, height, whatever )

 Anyone can feel free to chime in and tell me I'm wrong but , if you're
 going to
 do that, I'd appreciate statistical reasoning, even though I don't have
 any. thanks.


 mark






 On Thu, Aug 9, 2012 at 4:23 PM, Rui Barradas ruipbarra...@sapo.pt wrote:

 Hello,

 Have you seen the log-linear prediction of the 100m winning time in R
 mailed to the list yesterday by David Smith, subject  Revolutions Blog:
 July roundup?

 A log-linear regression in R predicted the gold-winning Olympic 100m
 sprint time to be 9.68 seconds (it was actually 9.63 seconds):
 http://bit.ly/QfChUh;

 The original by Markus Gesmann can be found at
 http://lamages.blogspot.pt/**2012/07/london-olympics-and-**
 prediction-for-100m.htmlhttp://lamages.blogspot.pt/2012/07/london-olympics-and-prediction-for-100m.html

 I've made the same, just changing the address to the 200m historical data,
 and the predicted time was 19.27. Usain Bolt has just made 19.32. If you
 want to check it, the address and the 'which' argument are:

 url - http://www.databasesports.**com/olympics/sport/sportevent.**
 htm?sp=ATHenum=120http://www.databasesports.com/olympics/sport/sportevent.htm?sp=ATHenum=120
 

 Plus a change in the graphic functions' y axis arguments to allow for
 times around the double to be ploted and seen.

 #
 # Original by Markus Gesmann:
 # http://lamages.blogspot.pt/**2012/07/london-olympics-and-**
 prediction-for-100m.htmlhttp://lamages.blogspot.pt/2012/07/london-olympics-and-prediction-for-100m.html
 library(XML)
 library(drc)
 url - http://www.databasesports.**com/olympics/sport/sportevent.**
 htm?sp=ATHenum=120http://www.databasesports.com/olympics/sport/sportevent.htm?sp=ATHenum=120
 
 data - readHTMLTable(readLines(url), which=3, header=TRUE)
 golddata - subset(data, Medal %in% GOLD)
 golddata$Year - as.numeric(as.character(**golddata$Year))
 golddata$Result - as.numeric(as.character(**golddata$Result))
 tail(golddata,10)
 logistic - drm(Result~Year, data=subset(golddata, Year=1900), fct =
 L.4())
 log.linear - lm(log(Result)~Year, data=subset(golddata, Year=1900))
 years - seq(1896,2012, 4)
 predictions - exp(predict(log.linear, newdata=data.frame(Year=years)**))
 plot(logistic,  xlim=c(1896,2012),
  ylim=range(golddata$Result) + c(-0.5, 0.5),
  xlab=Year, main=Olympic 100 metre,
  ylab=Winning time for the 100m men's final (s))
 points(golddata$Year, golddata$Result)
 lines(years, predictions, col=red)
 points(2012, predictions[length(years)], pch=19, col=red)
 text(2012 - 0.5, predictions[length(years)] - 0.5,
 

Re: [R] Olympics: 200m Men Final

2012-08-10 Thread Rui Barradas

Hello,

The main critique, I think, is that we assume a certain type of model 
where the times can decrease until zero. And that they can do so 
linearly. I believe that records can allways be beaten but 40-50 years 
ago times were measured in tenths of a second, now we see a gain in the 
hundreths as extraordinary. So the assumption doesn't seem to be 
completely reasonable.
As for your assumption that little variation in the responses results in 
little variation in the predictions, I would add that that is true but 
given a model only. The predictions can and do vary from model to model 
(obvious). See the logistic model in the same Gesmann work or Michael's 
ARIMA in a response to my post. Three different predicted values with 
variations from model to model in the tenths of a second. The values 
are, resp., 19.61 (Gesmann) and 19.67 and 19.56 (Weylandt).
Maybe the linear model performs well because, like you say, the 
sprinters post times very close to each other and a  straight line is 
not far from what a more complex model would do. I'm not betting on the 
marathon times.


Rui Barradas

Em 10-08-2012 05:31, Mark Leeds escreveu:

Hi Rui: I hate to sound like a pessimist/cynic and also I should state that
I didn't look
at any of the analysis by you or the other person. But, my question, ( for
anyone who wants to chime in ) is: given that all these olympic 100-200
meter runners post times that are generally within 0.1-0.3 seconds of each
other or even less, doesn't it stand to reason that a model, given the
historical times, is going to predict well. I don't know what the
statistical term is for this but intuitively, if there's extremely little
variation in the responses, then there's going to be extremely little
variation in the predictions and the result is that you won't be too far
off ever as long as your predictors are not too strange.  !   ( weight,
past performances, height, whatever )

Anyone can feel free to chime in and tell me I'm wrong but , if you're
going to
do that, I'd appreciate statistical reasoning, even though I don't have
any. thanks.


mark






On Thu, Aug 9, 2012 at 4:23 PM, Rui Barradas ruipbarra...@sapo.pt wrote:


Hello,

Have you seen the log-linear prediction of the 100m winning time in R
mailed to the list yesterday by David Smith, subject  Revolutions Blog:
July roundup?

A log-linear regression in R predicted the gold-winning Olympic 100m
sprint time to be 9.68 seconds (it was actually 9.63 seconds):
http://bit.ly/QfChUh;

The original by Markus Gesmann can be found at
http://lamages.blogspot.pt/**2012/07/london-olympics-and-**
prediction-for-100m.htmlhttp://lamages.blogspot.pt/2012/07/london-olympics-and-prediction-for-100m.html

I've made the same, just changing the address to the 200m historical data,
and the predicted time was 19.27. Usain Bolt has just made 19.32. If you
want to check it, the address and the 'which' argument are:

url - http://www.databasesports.**com/olympics/sport/sportevent.**
htm?sp=ATHenum=120http://www.databasesports.com/olympics/sport/sportevent.htm?sp=ATHenum=120


Plus a change in the graphic functions' y axis arguments to allow for
times around the double to be ploted and seen.

#
# Original by Markus Gesmann:
# http://lamages.blogspot.pt/**2012/07/london-olympics-and-**
prediction-for-100m.htmlhttp://lamages.blogspot.pt/2012/07/london-olympics-and-prediction-for-100m.html
library(XML)
library(drc)
url - http://www.databasesports.**com/olympics/sport/sportevent.**
htm?sp=ATHenum=120http://www.databasesports.com/olympics/sport/sportevent.htm?sp=ATHenum=120

data - readHTMLTable(readLines(url), which=3, header=TRUE)
golddata - subset(data, Medal %in% GOLD)
golddata$Year - as.numeric(as.character(**golddata$Year))
golddata$Result - as.numeric(as.character(**golddata$Result))
tail(golddata,10)
logistic - drm(Result~Year, data=subset(golddata, Year=1900), fct =
L.4())
log.linear - lm(log(Result)~Year, data=subset(golddata, Year=1900))
years - seq(1896,2012, 4)
predictions - exp(predict(log.linear, newdata=data.frame(Year=years)**))
plot(logistic,  xlim=c(1896,2012),
  ylim=range(golddata$Result) + c(-0.5, 0.5),
  xlab=Year, main=Olympic 100 metre,
  ylab=Winning time for the 100m men's final (s))
points(golddata$Year, golddata$Result)
lines(years, predictions, col=red)
points(2012, predictions[length(years)], pch=19, col=red)
text(2012 - 0.5, predictions[length(years)] - 0.5,
round(predictions[length(**years)],2))

Rui Barradas

__**
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/**listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/**
posting-guide.html http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do 

Re: [R] Olympics: 200m Men Final

2012-08-10 Thread Daróczi Gergely
On Fri, Aug 10, 2012 at 10:23 AM, Rui Barradas ruipbarra...@sapo.pt wrote:

 Hello,

 The main critique, I think, is that we assume a certain type of model
 where the times can decrease until zero. And that they can do so linearly.
 I believe that records can allways be beaten but 40-50 years ago times were
 measured in tenths of a second, now we see a gain in the hundreths as
 extraordinary. So the assumption doesn't seem to be completely reasonable.
 As for your assumption that little variation in the responses results in
 little variation in the predictions, I would add that that is true but
 given a model only. The predictions can and do vary from model to model
 (obvious). See the logistic model in the same Gesmann work or Michael's
 ARIMA in a response to my post. Three different predicted values with
 variations from model to model in the tenths of a second. The values are,
 resp., 19.61 (Gesmann) and 19.67 and 19.56 (Weylandt).
 Maybe the linear model performs well because, like you say, the sprinters
 post times very close to each other and a  straight line is not far from
 what a more complex model would do. I'm not betting on the marathon times.

 Rui Barradas

 Em 10-08-2012 05:31, Mark Leeds escreveu:

  Hi Rui: I hate to sound like a pessimist/cynic and also I should state
 that
 I didn't look
 at any of the analysis by you or the other person. But, my question, ( for
 anyone who wants to chime in ) is: given that all these olympic 100-200
 meter runners post times that are generally within 0.1-0.3 seconds of each
 other or even less, doesn't it stand to reason that a model, given the
 historical times, is going to predict well. I don't know what the
 statistical term is for this but intuitively, if there's extremely little
 variation in the responses, then there's going to be extremely little
 variation in the predictions and the result is that you won't be too far
 off ever as long as your predictors are not too strange.  !   (
 weight,
 past performances, height, whatever )

 Anyone can feel free to chime in and tell me I'm wrong but , if you're
 going to
 do that, I'd appreciate statistical reasoning, even though I don't have
 any. thanks.


 mark






 On Thu, Aug 9, 2012 at 4:23 PM, Rui Barradas ruipbarra...@sapo.pt
 wrote:

  Hello,

 Have you seen the log-linear prediction of the 100m winning time in R
 mailed to the list yesterday by David Smith, subject  Revolutions Blog:
 July roundup?

 A log-linear regression in R predicted the gold-winning Olympic 100m
 sprint time to be 9.68 seconds (it was actually 9.63 seconds):
 http://bit.ly/QfChUh;

 The original by Markus Gesmann can be found at
 http://lamages.blogspot.pt/2012/07/london-olympics-and-**http://lamages.blogspot.pt/**2012/07/london-olympics-and-**
 prediction-for-100m.htmlhttp:**//lamages.blogspot.pt/2012/07/**
 london-olympics-and-**prediction-for-100m.htmlhttp://lamages.blogspot.pt/2012/07/london-olympics-and-prediction-for-100m.html
 

 I've made the same, just changing the address to the 200m historical
 data,
 and the predicted time was 19.27. Usain Bolt has just made 19.32. If you
 want to check it, the address and the 'which' argument are:

 url - http://www.databasesports.com/olympics/sport/sportevent.
 htm?sp=ATHenum=120http://**www.databasesports.com/**
 olympics/sport/sportevent.htm?**sp=ATHenum=120http://www.databasesports.com/olympics/sport/sportevent.htm?sp=ATHenum=120
 
 

 Plus a change in the graphic functions' y axis arguments to allow for
 times around the double to be ploted and seen.

 #
 # Original by Markus Gesmann:
 # 
 http://lamages.blogspot.pt/2012/07/london-olympics-and-**http://lamages.blogspot.pt/**2012/07/london-olympics-and-**
 prediction-for-100m.htmlhttp:**//lamages.blogspot.pt/2012/07/**
 london-olympics-and-**prediction-for-100m.htmlhttp://lamages.blogspot.pt/2012/07/london-olympics-and-prediction-for-100m.html
 
 library(XML)
 library(drc)
 url - http://www.databasesports.com/olympics/sport/sportevent.
 htm?sp=ATHenum=120http://**www.databasesports.com/**
 olympics/sport/sportevent.htm?**sp=ATHenum=120http://www.databasesports.com/olympics/sport/sportevent.htm?sp=ATHenum=120
 
 
 data - readHTMLTable(readLines(url), which=3, header=TRUE)
 golddata - subset(data, Medal %in% GOLD)
 golddata$Year - as.numeric(as.character(golddata$Year))
 golddata$Result - as.numeric(as.character(golddata$Result))
 tail(golddata,10)
 logistic - drm(Result~Year, data=subset(golddata, Year=1900), fct =
 L.4())
 log.linear - lm(log(Result)~Year, data=subset(golddata, Year=1900))
 years - seq(1896,2012, 4)
 predictions - exp(predict(log.linear, newdata=data.frame(Year=years)**
 **))
 plot(logistic,  xlim=c(1896,2012),
   ylim=range(golddata$Result) + c(-0.5, 0.5),
   xlab=Year, main=Olympic 100 metre,
   ylab=Winning time for the 100m men's final (s))
 points(golddata$Year, golddata$Result)
 lines(years, predictions, col=red)
 points(2012, 

Re: [R] Olympics: 200m Men Final

2012-08-09 Thread Mark Leeds
Hi Rui: I hate to sound like a pessimist/cynic and also I should state that
I didn't look
at any of the analysis by you or the other person. But, my question, ( for
anyone who wants to chime in ) is: given that all these olympic 100-200
meter runners post times that are generally within 0.1-0.3 seconds of each
other or even less, doesn't it stand to reason that a model, given the
historical times, is going to predict well. I don't know what the
statistical term is for this but intuitively, if there's extremely little
variation in the responses, then there's going to be extremely little
variation in the predictions and the result is that you won't be too far
off ever as long as your predictors are not too strange.  !   ( weight,
past performances, height, whatever )

Anyone can feel free to chime in and tell me I'm wrong but , if you're
going to
do that, I'd appreciate statistical reasoning, even though I don't have
any. thanks.


mark






On Thu, Aug 9, 2012 at 4:23 PM, Rui Barradas ruipbarra...@sapo.pt wrote:

 Hello,

 Have you seen the log-linear prediction of the 100m winning time in R
 mailed to the list yesterday by David Smith, subject  Revolutions Blog:
 July roundup?

 A log-linear regression in R predicted the gold-winning Olympic 100m
 sprint time to be 9.68 seconds (it was actually 9.63 seconds):
 http://bit.ly/QfChUh;

 The original by Markus Gesmann can be found at
 http://lamages.blogspot.pt/**2012/07/london-olympics-and-**
 prediction-for-100m.htmlhttp://lamages.blogspot.pt/2012/07/london-olympics-and-prediction-for-100m.html

 I've made the same, just changing the address to the 200m historical data,
 and the predicted time was 19.27. Usain Bolt has just made 19.32. If you
 want to check it, the address and the 'which' argument are:

 url - http://www.databasesports.**com/olympics/sport/sportevent.**
 htm?sp=ATHenum=120http://www.databasesports.com/olympics/sport/sportevent.htm?sp=ATHenum=120
 

 Plus a change in the graphic functions' y axis arguments to allow for
 times around the double to be ploted and seen.

 #
 # Original by Markus Gesmann:
 # http://lamages.blogspot.pt/**2012/07/london-olympics-and-**
 prediction-for-100m.htmlhttp://lamages.blogspot.pt/2012/07/london-olympics-and-prediction-for-100m.html
 library(XML)
 library(drc)
 url - http://www.databasesports.**com/olympics/sport/sportevent.**
 htm?sp=ATHenum=120http://www.databasesports.com/olympics/sport/sportevent.htm?sp=ATHenum=120
 
 data - readHTMLTable(readLines(url), which=3, header=TRUE)
 golddata - subset(data, Medal %in% GOLD)
 golddata$Year - as.numeric(as.character(**golddata$Year))
 golddata$Result - as.numeric(as.character(**golddata$Result))
 tail(golddata,10)
 logistic - drm(Result~Year, data=subset(golddata, Year=1900), fct =
 L.4())
 log.linear - lm(log(Result)~Year, data=subset(golddata, Year=1900))
 years - seq(1896,2012, 4)
 predictions - exp(predict(log.linear, newdata=data.frame(Year=years)**))
 plot(logistic,  xlim=c(1896,2012),
  ylim=range(golddata$Result) + c(-0.5, 0.5),
  xlab=Year, main=Olympic 100 metre,
  ylab=Winning time for the 100m men's final (s))
 points(golddata$Year, golddata$Result)
 lines(years, predictions, col=red)
 points(2012, predictions[length(years)], pch=19, col=red)
 text(2012 - 0.5, predictions[length(years)] - 0.5,
 round(predictions[length(**years)],2))

 Rui Barradas

 __**
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/**listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/**
 posting-guide.html http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.