While I can't help much in the way of assessing the correlation (at least in a numerical sense), I have provided some code below to visualize the data bringing in an additional variable for the preseason ranking of the team according to the AP poll as it appears here: http://sports.espn.go.com/ncf/rankingsindex?seasonYear=2007&weekNumber=1 &seasonType=2) - Note that I did not double check my transcriptions of the preseason rankings so I do not guarantee my accuracy.
Here's the R code to create the graphic. load('BCS.RDA') # Add pre-season ranking via the AP Top 25 BCS$preseason <- c(11,28,2,36,29,8,3,9,NA,35,6,1,26,46,40,29,23,13,4,5,12,18,NA, 32,7,34,16,14,24,7,44,43,41,25,NA,NA,37,NA,17) rankp <- c(BCS$UR, BCS$HR, 1:dim(BCS)[1], BCS$preseason) comp <- rep(BCS$Cavg, 4) poll <- rep(c('Harris','USA Today','BCS','Pre-Season'), each=dim(BCS)[1]) dat <- data.frame(Rank=rankp, Cavg=comp, poll=poll, school=rep(rownames(BCS),4)) dat$schoolordered <- factor(dat$school, levels=rownames(BCS), labels=rownames(BCS), order=TRUE) bigcavg <- sort(BCS$Cavg)[29:39] library(lattice) new.back <- trellis.par.get("background") new.back$col <- "white" newcol <- trellis.par.get("superpose.symbol") newcol$col <- c('red','blue','black','green4','black') newcol$pch <- c(4,1,16,6) new.pan <- trellis.par.get("strip.background") new.pan$col <- c('grey90','white') trellis.par.set("background", new.back) trellis.par.set("superpose.symbol", newcol) trellis.par.set("strip.background",new.pan) xyplot(Cavg ~ Rank|schoolordered, group=poll, data=subset(dat, Rank <= 20), type=c('p'), xlim=c(-1.5,21.5), panel=function(x,y,...){ panel.abline(h=bigcavg, col='grey80') panel.abline(v=seq(0,20,5), col='grey60') panel.superpose(x,y,...) }, xlab='Poll Rank',ylab='Computer Average', key=list( points=list(col=trellis.par.get('superpose.symbol')$col[1:4], pch=trellis.par.get('superpose.symbol')$pch[1:4]), text=list(lab=sort(unique(poll)), col=trellis.par.get('superpose.symbol')$col[1:4]), columns=4, title='Poll System', cex=1) ) Notes on the output 1) Panels are arranged in order of BCS standing which is also characterized by the red x. 2) The top 10 Cavg scores are provided as horizontal lines in each panel. 3) For clarity I took a subset of the original data set only looking at rankings <= 20. Some comments 1) Arizona State: While 1-3 are consistent, AZ St. polls differ quite a bit w/ the BCS ranking being pulled up by the computers. 2) USC and OK St. and GA: Both have lower Cavg scores, but high human polls (Harris and USA Today) which tend to coincide w/ the preseason ranking. 3) As for bias, #2 seems to show some bias in the human polls, though I would not say the computer ranking is not w/o flaw. Basically, I can't find the magic bullet, and I imagine the debate will continue on what is the best way to determine the two best teams to play for the championship - a far from perfect scenario. Yet another debate of the use of Super Crunching if I may borrow from Ian Ayres. Open to any ideas/opinions. Cheers, -Mat -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Horace Tso Sent: Thursday, October 25, 2007 3:39 PM To: R Help; Douglas Bates Subject: Re: [R] Appropriate measure of correlation with'zero-inflated' data? Doug and the football fans out there, I'm no football expert myself. But here is what my colleague said after reading the posting. "I can't help you with the equation, but I can say that the polls are very poor predictors of performance. The reason they do such a bad job is that pollsters rank the teams even before the season starts based on perceived talent. That ranking system makes it hard for a team to move up the polls as long as the teams in front of them keep winning. Also, polling introduces many personal biases. "College football could easily solve the problem with a play-off system, but the powerful football conferences wouldn't make as much money, so they won't agree to it. " Cheers. Horace >>> "Douglas Bates" <[EMAIL PROTECTED]> 10/25/2007 10:58:24 AM >>> I have reached the correlation section in a course that I teach and I hit upon the idea of using data from the weekly Bowl Championship Series (BCS) rankings to illustrate different techniques for assessing correlation. For those not familiar with college football in the United States (where "football" refers to American football, not what is called soccer here and football in most other countries) I should explain that many, many universities and colleges have football teams but each team only plays 10-15 games per season, so not every team will play every other team. The game is so rough that it is not feasible to play more than one match per week and a national playoff after the regular season is impractical. It would take too long and the players are, in theory, students first and athletes second. In place of a national playoff there are various polls of coaches or sports writers that purport to rank teams nationally. Several analysts also publish computer-based rankings that use complicated formulas based on scores in individual games, strength of the opponent, etc. to rank teams. Rankings from two of the "human polls" (the Harris poll of sports writers and the USA Today poll of the coaches) and from six of the computer polls are combined to produce the official BCS ranking. The Wikipedia entry for "Bowl Championship Series" gives the history and evolution of the actual formula that is currently used. This season has been notable for the volatility of those rankings. One is reminded of the biblical prophesy that "The first shall be last and the last shall be first". Another notable feature this year is the extent to which the computer-based rankings and the rankings in the human polls disagree. I enclose a listing of the top 25 teams and the components of the rankings as of last Sunday (2007-10-21). (Almost all college football games are played on Saturdays and the rankings are published on Sundays). The columns are Rec - won-loss record Hvot - total number of Harris poll votes Hp - proportion of maximum Harris poll votes HR - rank in the Harris poll (smaller is better) Uvot, Up, UR - same for the USA Today poll Cavg - Average score (it's actually a trimmed mean) on computer-based rankings (larger is better) BCS - BCS score - the average of Hp, Up and Cavg Pre - BCS rank in the previous week As I understand it, the votes in the Harris and USA Today polls are calculated by asking each voter to list their top 25 teams then awarding 25 points for a team ranked 1, 24 points for a team ranked 2, etc. on each ballot and calculating the total. Apparently there are now 114 Harris poll participants and 60 USA Today poll participants giving maximum possible scores of 2850 and 1500, respectively. The Cavg column is calculated from 6 scores of 0 to 25 (larger is better) dropping the largest and smallest scores. The raw score is out of 100 and the proportion is reported as Cavg. The data frame is available (for a little while) as http://www.stat.wisc.edu/~bates/BCS.rda The raw scores and the rankings from the Harris and USA Today polls are in fairly good agreement but the Cavg scores are very different. Although scatterplots will show this I feel that correlation measures may be thrown off by the large number of zeros in the Cavg scores. What would be the preferred of measuring correlation in such a case? What would be a good graphical presentation showing the lack of agreement of the various components of the BCS score? [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.