Dear Ashim

As John said your two examples give the same model to within rounding error so it is not clear what you see the problem as being. You can always remove some of the correlation by subtracting out a large constant from x before you use poly() on it.

Michael

On 28/11/2019 16:02, Ashim Kapoor wrote:
On Thu, Nov 28, 2019 at 7:38 PM Fox, John <j...@mcmaster.ca> wrote:

Dear Ashim,

I'm afraid that much of what you say here is confused.

First, because poly(x) and poly(x, raw=TRUE) produce the same fitted
values (as I previously explained), they also produce the same residuals,
and consequently the same CV criteria. From the point of view of CV,
there's therefore no reason to prefer orthogonal polynomials. And you still
don't explain why you want to interpret the coefficients of the polynomial.


The trend in the variable that I am trying to create an ARIMA model for is
given by poly(x,4). That is why I wished to know what these polynomials
look like.

I used  :

trend <- predict(lm(gdp~poly(x,4)),newdata = data.frame(
x=94:103),interval="confidence")

and I was able to (numerically) extrapolate the poly(x,4) trend, although,
I think it would be interesting to know what polynomials I was dealing with
in this case. Just some intuition as to if the linear / quadratic / cubic /
fourth order polynomial trend is dominating. I don't know how I would
interpret them, but it would be fun to know.

Please allow me to show you a trick. I read this on the internet, here :-

https://www.datasciencecentral.com/profiles/blogs/deep-dive-into-polynomial-regression-and-overfitting

Please see the LAST comment by Scott Stelljes where he suggests using an
orthogonal polynomial basis. He does not elaborate but leaves the reader to
work out the details.

Here is what I think of this. Take a big number say 20 and take a variable
in which we are trying to find the order of the polynomial in the trend.
Like this :-

summary(lm(gdp ~ poly(x,20)))

Call:
lm(formula = gdp ~ poly(x, 20))

Residuals:
      Min       1Q   Median       3Q      Max
-1235661  -367798   -80453   240360  1450906

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)    17601482      66934 262.968  < 2e-16 ***
poly(x, 20)1  125679081     645487 194.704  < 2e-16 ***
poly(x, 20)2   43108747     645487  66.785  < 2e-16 ***
poly(x, 20)3    3605839     645487   5.586 3.89e-07 ***
poly(x, 20)4   -2977277     645487  -4.612 1.69e-05 ***
poly(x, 20)5    1085732     645487   1.682   0.0969 .
poly(x, 20)6    1124125     645487   1.742   0.0859 .
poly(x, 20)7    -108676     645487  -0.168   0.8668
poly(x, 20)8    -976915     645487  -1.513   0.1345
poly(x, 20)9   -1635444     645487  -2.534   0.0135 *
poly(x, 20)10   -715019     645487  -1.108   0.2717
poly(x, 20)11    347102     645487   0.538   0.5924
poly(x, 20)12   -176728     645487  -0.274   0.7850
poly(x, 20)13   -634151     645487  -0.982   0.3292
poly(x, 20)14   -537725     645487  -0.833   0.4076
poly(x, 20)15    -58674     645487  -0.091   0.9278
poly(x, 20)16    -67030     645487  -0.104   0.9176
poly(x, 20)17   -809443     645487  -1.254   0.2139
poly(x, 20)18   -668879     645487  -1.036   0.3036
poly(x, 20)19   -302384     645487  -0.468   0.6409
poly(x, 20)20    359134     645487   0.556   0.5797
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 645500 on 72 degrees of freedom
Multiple R-squared:  0.9983, Adjusted R-squared:  0.9978
F-statistic:  2122 on 20 and 72 DF,  p-value: < 2.2e-16




The CV estimate of the trend is 4. I am not saying this method is perfect,
but look above this method also suggests n=4.

I CANNOT do this with raw polynomials, since they are correlated and
JOINTLY in the presence of others they may not be significant, please see
below.

summary(lm(gdp ~ poly(x,20,raw=T)))

Call:
lm(formula = gdp ~ poly(x, 20, raw = T))

Residuals:
      Min       1Q   Median       3Q      Max
-1286007  -372221   -81320   248510  1589130

Coefficients: (4 not defined because of singularities)
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)             2.067e+06  2.649e+06   0.780    0.438
poly(x, 20, raw = T)1   1.633e+06  3.556e+06   0.459    0.647
poly(x, 20, raw = T)2  -7.601e+05  1.679e+06  -0.453    0.652
poly(x, 20, raw = T)3   1.861e+05  3.962e+05   0.470    0.640
poly(x, 20, raw = T)4  -2.634e+04  5.480e+04  -0.481    0.632
poly(x, 20, raw = T)5   2.370e+03  4.854e+03   0.488    0.627
poly(x, 20, raw = T)6  -1.434e+02  2.906e+02  -0.493    0.623
poly(x, 20, raw = T)7   6.022e+00  1.213e+01   0.496    0.621
poly(x, 20, raw = T)8  -1.784e-01  3.587e-01  -0.497    0.620
poly(x, 20, raw = T)9   3.727e-03  7.503e-03   0.497    0.621
poly(x, 20, raw = T)10 -5.373e-05  1.086e-04  -0.495    0.622
poly(x, 20, raw = T)11  5.016e-07  1.018e-06   0.493    0.624
poly(x, 20, raw = T)12 -2.483e-09  5.069e-09  -0.490    0.626
poly(x, 20, raw = T)13         NA         NA      NA       NA
poly(x, 20, raw = T)14  5.656e-14  1.167e-13   0.485    0.629
poly(x, 20, raw = T)15         NA         NA      NA       NA
poly(x, 20, raw = T)16 -1.933e-18  4.011e-18  -0.482    0.631
poly(x, 20, raw = T)17         NA         NA      NA       NA
poly(x, 20, raw = T)18  5.181e-23  1.076e-22   0.482    0.631
poly(x, 20, raw = T)19         NA         NA      NA       NA
poly(x, 20, raw = T)20 -7.173e-28  1.480e-27  -0.485    0.629

Residual standard error: 641000 on 76 degrees of freedom
Multiple R-squared:  0.9982, Adjusted R-squared:  0.9979
F-statistic:  2690 on 16 and 76 DF,  p-value: < 2.2e-16



Note,however, once the orthogonal polynomials have suggested a number, 4 in
this case, I can do this :-

  summary(lm(gdp ~ poly(x,4,raw=T)))

Call:
lm(formula = gdp ~ poly(x, 4, raw = T))

Residuals:
      Min       1Q   Median       3Q      Max
-1278673  -424315   -22357   310977  1731813

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)           3.022e+06  3.676e+05   8.220 1.64e-12 ***
poly(x, 4, raw = T)1  1.741e+05  5.357e+04   3.249  0.00164 **
poly(x, 4, raw = T)2 -6.434e+03  2.300e+03  -2.797  0.00633 **
poly(x, 4, raw = T)3  1.878e+02  3.667e+01   5.123 1.76e-06 ***
poly(x, 4, raw = T)4 -8.682e-01  1.935e-01  -4.486 2.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 663700 on 88 degrees of freedom
Multiple R-squared:  0.9978, Adjusted R-squared:  0.9977
F-statistic: 1.003e+04 on 4 and 88 DF,  p-value: < 2.2e-16



Although due to correlations they may not be significant jointly, but in
this case all 4 powers come out significant.


Second, the model formula gdp~1+x+x^2 and other similar formulas in your
message don't do what you think. Like + and *, the ^ operator has special
meaning on the right-hand side of an R model formula. See ?Formula and
perhaps read something about statistical models in R. For example:

x <- 1:93
y <- 1 + x + x^2 + x^3 + x^4 + rnorm(93)
(m <- lm(y ~ x + x^2))

Call:
lm(formula = y ~ x + x^2)

Coefficients:
(Intercept)            x
   -15781393       667147

While gpp ~ x + I(x^2) would work, a better way to fit a raw quadratic is
as gdp ~ poly(x, 2, raw=TRUE), as I suggested in my earlier message.


My bad. Yes, I have some idea of the Wilkinson-Rogers notation. I have seen
it in books, although it slipped my mind that I had to use I( ).


Finally, as to what you should do, I generally try to avoid statistical
consulting by email. If you can find competent statistical help locally,
such as at a nearby university, I'd recommend talking to someone about the
purpose of your research and the nature of your data. If that's not
possible, then others have suggested where you might find help, but to get
useful advice you'll have to provide much more information about your
research.


My original query was about the polynomials used by R which I think is ON
topic. My apologies that this query turned into a statistics query.


Best,
  John

   -----------------------------
   John Fox, Professor Emeritus
   McMaster University
   Hamilton, Ontario, Canada
   Web: http::/socserv.mcmaster.ca/jfox

On Nov 28, 2019, at 12:46 AM, Ashim Kapoor <ashimkap...@gmail.com>
wrote:

Dear Peter and John,

Many thanks for your prompt replies.

Here is what I was trying to do.  I was trying to build a statistical
model of a given time series using Box Jenkins methodology. The series has
93 data points. Before I analyse the ACF and PACF, I am required to
de-trend the series. The series seems to have an upward trend. I wanted to
find out what order polynomial should I fit the series
without overfitting.  For this I want to use orthogonal polynomials(I
think someone on the internet was talking about preventing overfitting by
using orthogonal polynomials) . This seems to me as a poor man's cross
validation.

So my plan is to keep increasing the degree of the orthogonal
polynomials till the coefficient of the last orthogonal polynomial becomes
insignificant.

Note : If I do NOT use orthogonal polynomials, I will overfit the data
set and I don't think that is a good way to detect the true order of the
polynomial.

Also now that I have detrended the series and built an ARIMA model of
the residuals, now I want to forecast. For this I need to use the original
polynomials and their coefficients.

I hope I was clear and that my methodology is ok.

I have another query here :-

Note : If I used cross-validation to determine the order of the
polynomial, I don't get a clear answer.

See here :-
library(boot)
mydf = data.frame(cbind(gdp,x))
d<-(c(
cv.glm(data = mydf,glm(gdp~x),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~poly(x,2)),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~poly(x,3)),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~poly(x,4)),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~poly(x,5)),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~poly(x,6)),K=10)$delta[1]))
print(d)
## [1] 2.178574e+13 7.303031e+11 5.994783e+11 4.943586e+11 4.596648e+11
## [6] 4.980159e+11

# Here it chooses 5. (but 4 and 5 are kind of similar).


d1 <- (c(
cv.glm(data = mydf,glm(gdp~1+x),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~1+x+x^2),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3+x^4),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3+x^4+x^5),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3+x^4+x^5+x^6),K=10)$delta[1]))

print(d1)
## [1] 2.149647e+13 2.253999e+13 2.182175e+13 2.177170e+13 2.198675e+13
## [6] 2.145754e+13

# here it chooses 1 or 6

Query : Why does it choose 1? Notice : Is this just round off noise /
noise due to sampling error created by Cross Validation when it creates the
K folds? Is this due to the ill conditioned model matrix?

Best Regards,
Ashim.





On Wed, Nov 27, 2019 at 10:41 PM Fox, John <j...@mcmaster.ca> wrote:
Dear Ashim,

Orthogonal polynomials are used because they tend to produce more
accurate numerical computations, not because their coefficients are
interpretable, so I wonder why you're interested in the coefficients.

The regressors produced are orthogonal to the constant regressor and are
orthogonal to each other (and in fact are orthonormal), as it's simple to
demonstrate:

------- snip -------

x <- 1:93
y <- 1 + x + x^2 + x^3 + x^4 + rnorm(93)
(m <- lm(y ~ poly(x, 4)))

Call:
lm(formula = y ~ poly(x, 4))

Coefficients:
(Intercept)  poly(x, 4)1  poly(x, 4)2  poly(x, 4)3  poly(x, 4)4
    15574516    172715069     94769949     27683528      3429259

X <- model.matrix(m)
head(X)
   (Intercept) poly(x, 4)1 poly(x, 4)2 poly(x, 4)3 poly(x, 4)4
1           1  -0.1776843   0.2245083  -0.2572066  0.27935949
2           1  -0.1738216   0.2098665  -0.2236579  0.21862917
3           1  -0.1699589   0.1955464  -0.1919525  0.16390514
4           1  -0.1660962   0.1815482  -0.1620496  0.11487597
5           1  -0.1622335   0.1678717  -0.1339080  0.07123722
6           1  -0.1583708   0.1545171  -0.1074869  0.03269145

zapsmall(crossprod(X))# X'X
             (Intercept) poly(x, 4)1 poly(x, 4)2 poly(x, 4)3 poly(x, 4)4
(Intercept)          93           0           0           0           0
poly(x, 4)1           0           1           0           0           0
poly(x, 4)2           0           0           1           0           0
poly(x, 4)3           0           0           0           1           0
poly(x, 4)4           0           0           0           0           1

------- snip -------

If for some not immediately obvious reason you're interested in the
regression coefficients, why not just use a "raw" polynomial:

------- snip -------

(m1 <- lm(y ~ poly(x, 4, raw=TRUE)))

Call:
lm(formula = y ~ poly(x, 4, raw = TRUE))

Coefficients:
             (Intercept)  poly(x, 4, raw = TRUE)1  poly(x, 4, raw =
TRUE)2  poly(x, 4, raw = TRUE)3
                  1.5640                   0.8985
  1.0037                   1.0000
poly(x, 4, raw = TRUE)4
                  1.0000

------- snip -------

These coefficients are simply interpretable but the model matrix is more
poorly conditioned:

------- snip -------

head(X1)
   (Intercept) poly(x, 4, raw = TRUE)1 poly(x, 4, raw = TRUE)2 poly(x, 4,
raw = TRUE)3
1           1                       1                       1
            1
2           1                       2                       4
            8
3           1                       3                       9
           27
4           1                       4                      16
           64
5           1                       5                      25
          125
6           1                       6                      36
          216
   poly(x, 4, raw = TRUE)4
1                       1
2                      16
3                      81
4                     256
5                     625
6                    1296
round(cor(X1[, -1]), 2)
                         poly(x, 4, raw = TRUE)1 poly(x, 4, raw = TRUE)2
poly(x, 4, raw = TRUE)3
poly(x, 4, raw = TRUE)1                    1.00                    0.97
                   0.92
poly(x, 4, raw = TRUE)2                    0.97                    1.00
                   0.99
poly(x, 4, raw = TRUE)3                    0.92                    0.99
                   1.00
poly(x, 4, raw = TRUE)4                    0.87                    0.96
                   0.99
                         poly(x, 4, raw = TRUE)4
poly(x, 4, raw = TRUE)1                    0.87
poly(x, 4, raw = TRUE)2                    0.96
poly(x, 4, raw = TRUE)3                    0.99
poly(x, 4, raw = TRUE)4                    1.00

------- snip -------

The two parametrizations are equivalent, however, in that they represent
the same regression surface, and so, e.g., produce the same fitted values:

------- snip -------

all.equal(fitted(m), fitted(m1))
[1] TRUE

------- snip -------

Because one is usually not interested in the individual coefficients of
a polynomial there usually isn't a reason to prefer one parametrization to
the other on the grounds of interpretability, so why do you need to
interpret the regression equation?

I hope this helps,
  John

   -----------------------------
   John Fox, Professor Emeritus
   McMaster University
   Hamilton, Ontario, Canada
   Web: http::/socserv.mcmaster.ca/jfox

On Nov 27, 2019, at 10:17 AM, Ashim Kapoor <ashimkap...@gmail.com>
wrote:

Dear Petr,

Many thanks for the quick response.

I also read this:-
https://en.wikipedia.org/wiki/Discrete_orthogonal_polynomials

Also I read  in ?poly:-
     The orthogonal polynomial is summarized by the coefficients, which
     can be used to evaluate it via the three-term recursion given in
     Kennedy & Gentle (1980, pp. 343-4), and used in the ‘predict’ part
     of the code.

I don't have access to the mentioned book.

Out of curiosity, what is the name of the discrete orthogonal
polynomial
used by R ?
What discrete measure is it orthogonal with respect to ?

Many thanks,
Ashim




On Wed, Nov 27, 2019 at 6:11 PM PIKAL Petr <petr.pi...@precheza.cz>
wrote:

You could get answer quickly by searching net.



https://stackoverflow.com/questions/39031172/how-poly-generates-orthogonal-p
olynomials-how-to-understand-the-coefs-ret/39051154#39051154
<
https://stackoverflow.com/questions/39031172/how-poly-generates-orthogonal-polynomials-how-to-understand-the-coefs-ret/39051154#39051154


Cheers
Petr

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Ashim
Kapoor
Sent: Wednesday, November 27, 2019 12:55 PM
To: R Help <r-help@r-project.org>
Subject: [R] Orthogonal polynomials used by R

Dear All,

I have created a time trend by doing x<-1:93 because I have a time
series
with 93 data points. Next I did :-

y = lm(series ~ poly(x,4))$residuals

to detrend series.

I choose this 4 as the order of my polynomial using cross validation/
checking the absence of trend in the residuals so I think I have not
overfit
this series.

I wish to document the formula of poly(x,4). I am not able to find
it in
?poly

Can someone please tell me what the formula for the orthogonal
polynomial used by R is ?

Thank you,
Ashim

      [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html
and provide commented, minimal, self-contained, reproducible code.


       [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Michael
http://www.dewey.myzen.co.uk/home.html

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to