Dear Peter and John, Many thanks for your prompt replies.
Here is what I was trying to do. I was trying to build a statistical model of a given time series using Box Jenkins methodology. The series has 93 data points. Before I analyse the ACF and PACF, I am required to de-trend the series. The series seems to have an upward trend. I wanted to find out what order polynomial should I fit the series without overfitting. For this I want to use orthogonal polynomials(I think someone on the internet was talking about preventing overfitting by using orthogonal polynomials) . This seems to me as a poor man's cross validation. So my plan is to keep increasing the degree of the orthogonal polynomials till the coefficient of the last orthogonal polynomial becomes insignificant. Note : If I do NOT use orthogonal polynomials, I will overfit the data set and I don't think that is a good way to detect the true order of the polynomial. Also now that I have detrended the series and built an ARIMA model of the residuals, now I want to forecast. For this I need to use the original polynomials and their coefficients. I hope I was clear and that my methodology is ok. I have another query here :- Note : If I used cross-validation to determine the order of the polynomial, I don't get a clear answer. See here :- library(boot) mydf = data.frame(cbind(gdp,x)) d<-(c( cv.glm(data = mydf,glm(gdp~x),K=10)$delta[1], cv.glm(data = mydf,glm(gdp~poly(x,2)),K=10)$delta[1], cv.glm(data = mydf,glm(gdp~poly(x,3)),K=10)$delta[1], cv.glm(data = mydf,glm(gdp~poly(x,4)),K=10)$delta[1], cv.glm(data = mydf,glm(gdp~poly(x,5)),K=10)$delta[1], cv.glm(data = mydf,glm(gdp~poly(x,6)),K=10)$delta[1])) print(d) ## [1] 2.178574e+13 7.303031e+11 5.994783e+11 4.943586e+11 4.596648e+11 ## [6] 4.980159e+11 # Here it chooses 5. (but 4 and 5 are kind of similar). d1 <- (c( cv.glm(data = mydf,glm(gdp~1+x),K=10)$delta[1], cv.glm(data = mydf,glm(gdp~1+x+x^2),K=10)$delta[1], cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3),K=10)$delta[1], cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3+x^4),K=10)$delta[1], cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3+x^4+x^5),K=10)$delta[1], cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3+x^4+x^5+x^6),K=10)$delta[1])) print(d1) ## [1] 2.149647e+13 2.253999e+13 2.182175e+13 2.177170e+13 2.198675e+13 ## [6] 2.145754e+13 # here it chooses 1 or 6 Query : Why does it choose 1? Notice : Is this just round off noise / noise due to sampling error created by Cross Validation when it creates the K folds? Is this due to the ill conditioned model matrix? Best Regards, Ashim. On Wed, Nov 27, 2019 at 10:41 PM Fox, John <j...@mcmaster.ca> wrote: > Dear Ashim, > > Orthogonal polynomials are used because they tend to produce more accurate > numerical computations, not because their coefficients are interpretable, > so I wonder why you're interested in the coefficients. > > The regressors produced are orthogonal to the constant regressor and are > orthogonal to each other (and in fact are orthonormal), as it's simple to > demonstrate: > > ------- snip ------- > > > x <- 1:93 > > y <- 1 + x + x^2 + x^3 + x^4 + rnorm(93) > > (m <- lm(y ~ poly(x, 4))) > > Call: > lm(formula = y ~ poly(x, 4)) > > Coefficients: > (Intercept) poly(x, 4)1 poly(x, 4)2 poly(x, 4)3 poly(x, 4)4 > 15574516 172715069 94769949 27683528 3429259 > > > X <- model.matrix(m) > > head(X) > (Intercept) poly(x, 4)1 poly(x, 4)2 poly(x, 4)3 poly(x, 4)4 > 1 1 -0.1776843 0.2245083 -0.2572066 0.27935949 > 2 1 -0.1738216 0.2098665 -0.2236579 0.21862917 > 3 1 -0.1699589 0.1955464 -0.1919525 0.16390514 > 4 1 -0.1660962 0.1815482 -0.1620496 0.11487597 > 5 1 -0.1622335 0.1678717 -0.1339080 0.07123722 > 6 1 -0.1583708 0.1545171 -0.1074869 0.03269145 > > > zapsmall(crossprod(X))# X'X > (Intercept) poly(x, 4)1 poly(x, 4)2 poly(x, 4)3 poly(x, 4)4 > (Intercept) 93 0 0 0 0 > poly(x, 4)1 0 1 0 0 0 > poly(x, 4)2 0 0 1 0 0 > poly(x, 4)3 0 0 0 1 0 > poly(x, 4)4 0 0 0 0 1 > > ------- snip ------- > > If for some not immediately obvious reason you're interested in the > regression coefficients, why not just use a "raw" polynomial: > > ------- snip ------- > > > (m1 <- lm(y ~ poly(x, 4, raw=TRUE))) > > Call: > lm(formula = y ~ poly(x, 4, raw = TRUE)) > > Coefficients: > (Intercept) poly(x, 4, raw = TRUE)1 poly(x, 4, raw = TRUE)2 > poly(x, 4, raw = TRUE)3 > 1.5640 0.8985 1.0037 > 1.0000 > poly(x, 4, raw = TRUE)4 > 1.0000 > > ------- snip ------- > > These coefficients are simply interpretable but the model matrix is more > poorly conditioned: > > ------- snip ------- > > > head(X1) > (Intercept) poly(x, 4, raw = TRUE)1 poly(x, 4, raw = TRUE)2 poly(x, 4, > raw = TRUE)3 > 1 1 1 1 > 1 > 2 1 2 4 > 8 > 3 1 3 9 > 27 > 4 1 4 16 > 64 > 5 1 5 25 > 125 > 6 1 6 36 > 216 > poly(x, 4, raw = TRUE)4 > 1 1 > 2 16 > 3 81 > 4 256 > 5 625 > 6 1296 > > round(cor(X1[, -1]), 2) > poly(x, 4, raw = TRUE)1 poly(x, 4, raw = TRUE)2 > poly(x, 4, raw = TRUE)3 > poly(x, 4, raw = TRUE)1 1.00 0.97 > 0.92 > poly(x, 4, raw = TRUE)2 0.97 1.00 > 0.99 > poly(x, 4, raw = TRUE)3 0.92 0.99 > 1.00 > poly(x, 4, raw = TRUE)4 0.87 0.96 > 0.99 > poly(x, 4, raw = TRUE)4 > poly(x, 4, raw = TRUE)1 0.87 > poly(x, 4, raw = TRUE)2 0.96 > poly(x, 4, raw = TRUE)3 0.99 > poly(x, 4, raw = TRUE)4 1.00 > > ------- snip ------- > > The two parametrizations are equivalent, however, in that they represent > the same regression surface, and so, e.g., produce the same fitted values: > > ------- snip ------- > > > all.equal(fitted(m), fitted(m1)) > [1] TRUE > > ------- snip ------- > > Because one is usually not interested in the individual coefficients of a > polynomial there usually isn't a reason to prefer one parametrization to > the other on the grounds of interpretability, so why do you need to > interpret the regression equation? > > I hope this helps, > John > > ----------------------------- > John Fox, Professor Emeritus > McMaster University > Hamilton, Ontario, Canada > Web: http::/socserv.mcmaster.ca/jfox > > > On Nov 27, 2019, at 10:17 AM, Ashim Kapoor <ashimkap...@gmail.com> > wrote: > > > > Dear Petr, > > > > Many thanks for the quick response. > > > > I also read this:- > > https://en.wikipedia.org/wiki/Discrete_orthogonal_polynomials > > > > Also I read in ?poly:- > > The orthogonal polynomial is summarized by the coefficients, which > > can be used to evaluate it via the three-term recursion given in > > Kennedy & Gentle (1980, pp. 343-4), and used in the ‘predict’ part > > of the code. > > > > I don't have access to the mentioned book. > > > > Out of curiosity, what is the name of the discrete orthogonal polynomial > > used by R ? > > What discrete measure is it orthogonal with respect to ? > > > > Many thanks, > > Ashim > > > > > > > > > > On Wed, Nov 27, 2019 at 6:11 PM PIKAL Petr <petr.pi...@precheza.cz> > wrote: > > > >> You could get answer quickly by searching net. > >> > >> > >> > https://stackoverflow.com/questions/39031172/how-poly-generates-orthogonal-p > >> olynomials-how-to-understand-the-coefs-ret/39051154#39051154 > >> < > https://stackoverflow.com/questions/39031172/how-poly-generates-orthogonal-polynomials-how-to-understand-the-coefs-ret/39051154#39051154 > > > >> > >> Cheers > >> Petr > >> > >>> -----Original Message----- > >>> From: R-help <r-help-boun...@r-project.org> On Behalf Of Ashim Kapoor > >>> Sent: Wednesday, November 27, 2019 12:55 PM > >>> To: R Help <r-help@r-project.org> > >>> Subject: [R] Orthogonal polynomials used by R > >>> > >>> Dear All, > >>> > >>> I have created a time trend by doing x<-1:93 because I have a time > series > >>> with 93 data points. Next I did :- > >>> > >>> y = lm(series ~ poly(x,4))$residuals > >>> > >>> to detrend series. > >>> > >>> I choose this 4 as the order of my polynomial using cross validation/ > >>> checking the absence of trend in the residuals so I think I have not > >> overfit > >>> this series. > >>> > >>> I wish to document the formula of poly(x,4). I am not able to find it > in > >> ?poly > >>> > >>> Can someone please tell me what the formula for the orthogonal > >>> polynomial used by R is ? > >>> > >>> Thank you, > >>> Ashim > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide http://www.R-project.org/posting- > >>> guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.