[R] log-transformed linear regression

2010-11-10 Thread servet cizmeli

Hello,

I have a basic question. Sorry if it is so evident

I have the following data file :
http://ekumen.homelinux.net/mydata.txt

I need to model Y~X-1 (simple linear regression through the origin) with
these data :

load(file="mydata.txt")
X=k[,1]
Y=k[,2]

aa=lm(Y~X-1)
dev.new()
plot(X,Y,log="xy")
abline(aa,untf=T)
abline(b=0.0235, a=0,col="red",untf=T)
abline(b=0.031, a=0,col="green",untf=T)

Other people did the same kind of analysis with their data and found the
regression coefficients of 0.0235 (red line) and 0.031 (green line).

Regression with my own data, though, yields a slope of 0.0458 (black
line) which is too high. Clearly my regression is too much influenced by
the single point with high values (X>100). I would not like to discard
this point, though, because I know that the measurement is correct. I
just would like to give it less weight...

When I log-transform X and Y data, I obtain :

dev.new()
plot(log10(X),log10(Y))
abline(v=0,h=0,col="cyan")
bb=lm(log10(Y)~log10(X))
abline(bb,col="blue")
bb

I am happy with this regression. Now the slope is at the log-log domain.
I have to convert it back so that I can obtain a number comparable with
the literature (0.0235 and 0.031).  How to do it? I can't force the
second regression through the origin as the log-transformed data does
not go through the origin anymore.

at first it seemed like an easy problem but I am at loss :o((
thanks a lot for your kindly help
servet

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] log-transformed linear regression

2010-11-10 Thread servet cizmeli

Dear List,

I would like to take another chance and see if there if someone has 
anything to say to my last post...


bump

servet


On 11/10/2010 01:11 PM, servet cizmeli wrote:

Hello,

I have a basic question. Sorry if it is so evident

I have the following data file :
http://ekumen.homelinux.net/mydata.txt

I need to model Y~X-1 (simple linear regression through the origin) with
these data :

load(file="mydata.txt")
X=k[,1]
Y=k[,2]

aa=lm(Y~X-1)
dev.new()
plot(X,Y,log="xy")
abline(aa,untf=T)
abline(b=0.0235, a=0,col="red",untf=T)
abline(b=0.031, a=0,col="green",untf=T)

Other people did the same kind of analysis with their data and found the
regression coefficients of 0.0235 (red line) and 0.031 (green line).

Regression with my own data, though, yields a slope of 0.0458 (black
line) which is too high. Clearly my regression is too much influenced by
the single point with high values (X>100). I would not like to discard
this point, though, because I know that the measurement is correct. I
just would like to give it less weight...

When I log-transform X and Y data, I obtain :

dev.new()
plot(log10(X),log10(Y))
abline(v=0,h=0,col="cyan")
bb=lm(log10(Y)~log10(X))
abline(bb,col="blue")
bb

I am happy with this regression. Now the slope is at the log-log domain.
I have to convert it back so that I can obtain a number comparable with
the literature (0.0235 and 0.031).  How to do it? I can't force the
second regression through the origin as the log-transformed data does
not go through the origin anymore.

at first it seemed like an easy problem but I am at loss :o((
thanks a lot for your kindly help
servet

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] log-transformed linear regression

2010-11-11 Thread Mike Marchywka






> Date: Wed, 10 Nov 2010 19:27:20 -0500
> From: sa.cizm...@usherbrooke.ca
> To: r-help@r-project.org
> Subject: Re: [R] log-transformed linear regression
>
> Dear List,
>
> I would like to take another chance and see if there if someone has
> anything to say to my last post...

ok, I took a look at it to help myself re-learn some R so
I solicit better alts from R users in comments below.

If I expand scale down where you have lots of points, the visual
is something like a plateau for x<1 and then a general noisy
trend upward- the slope you get for the aggregate blob of
data probably depends on how your sampling is weighted. I wouldn't
worry about the one point at large X in isolation quite yet but
you may want to consider a piecewise linear or other way to handle
the low X valued data. You may just want to spend more time staring
at pictures before churning out p-values LOL. 

To be clear about your log issues, lm will fit a linear model to the stuff you
give it and of course y=a*x means something different from log(y) =a*log(x).
Overall, you need to figure out what to do with the models depending on
what you think is a candidate for being real- indeed your points at large
X and Y may be out of the linear range for this "system" but who knows. 

If you just try to fit a bunch of data to a line,
the slope of course depends on which part of this bunch you have
sampled. Generally to explore sensitivity, try things like this,
with ".0" of course replaced by whatever threshold you want.
It is easy to automate, put this in a loop and plot historgrams
of the resultant slopes etc. 

> sel=runif(length(X))>.0
> aasel=lm(Y[sel]~X[sel])
> summary(aasel)

Call:
lm(formula = Y[sel] ~ X[sel])

Residuals:
  Min    1Q    Median    3Q   Max
-0.221957 -0.004207  0.004055  0.013395  0.232362

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0256642  0.0033254  -7.718 1.54e-12 ***
X[sel]   0.0463599  0.0003146 147.365  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0399 on 150 degrees of freedom
Multiple R-squared: 0.9931, Adjusted R-squared: 0.9931
F-statistic: 2.172e+04 on 1 and 150 DF,  p-value: < 2.2e-16

if you increase ".0" to something like .5 or so and keep repating
this you get some idea of what can happen, make changes for the various
models and then decide what you are trying to do etc. 



> sel=runif(length(X))>.9
> aasel=lm(Y[sel]~X[sel])
> summary(aasel)

Call:
lm(formula = Y[sel] ~ X[sel])

Residuals:
  Min    1Q    Median    3Q   Max
-0.199943 -0.002839  0.010626  0.019608  0.149099

Coefficients:
 Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.035546   0.013741  -2.587   0.0162 *
X[sel]   0.052033   0.003401  15.301 7.04e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05878 on 24 degrees of freedom
Multiple R-squared: 0.907,  Adjusted R-squared: 0.9031
F-statistic: 234.1 on 1 and 24 DF,  p-value: 7.037e-14


>
> bump
>
> servet
>
>
> On 11/10/2010 01:11 PM, servet cizmeli wrote:
> > Hello,
> >
> > I have a basic question. Sorry if it is so evident
> >
> > I have the following data file :
> > http://ekumen.homelinux.net/mydata.txt
> >
> > I need to model Y~X-1 (simple linear regression through the origin) with
> > these data :
> >
> > load(file="mydata.txt")
> > X=k[,1]
> > Y=k[,2]
> >
> > aa=lm(Y~X-1)
> > dev.new()
> > plot(X,Y,log="xy")
> > abline(aa,untf=T)
> > abline(b=0.0235, a=0,col="red",untf=T)
> > abline(b=0.031, a=0,col="green",untf=T)
> >
> > Other people did the same kind of analysis with their data and found the
> > regression coefficients of 0.0235 (red line) and 0.031 (green line).
> >
> > Regression with my own data, though, yields a slope of 0.0458 (black
> > line) which is too high. Clearly my regression is too much influenced by
> > the single point with high values (X>100). I would not like to discard
> > this point, though, because I know that the measurement is correct. I
> > just would like to give it less weight...
> >
> > When I log-transform X and Y data, I obtain :
> >
> > dev.new()
> > plot(log10(X),log10(Y))
> > abline(v=0,h=0,col="cyan")
> > bb=lm(log10(Y)~log10(X))
> > abline(bb,col="blue")
> > bb
> >
> > I am happy with this regression. Now the slope is at the log-log domain.
> > I have to conver

Re: [R] log-transformed linear regression

2010-11-11 Thread Mike Marchywka




I had a few more thoughts but briefly it almost 
looks like exp at low X and lin at higher X may
be something to test. Anyway, I'd be hard pressed to
see how a linear fit, that you are happy to turn to
log-log fit, could give you "a" number of any relevance
except maybe over a few limited ranges.

> plot(X,log(Y))
> sel=(X>2)
> plot(X[sel],log(Y[sel]))
> sel=(X<2)
> plot(X[sel],log(Y[sel]))

I had some longer message ideas but you can keep running models on
selected subsets of your data and see if that one point is your
real problem. And, above all, I wouldn't try to find an error measure
that just helps you match prior literature numbers found via a different 
method...



> From: marchy...@hotmail.com
> To: sa.cizm...@usherbrooke.ca; r-help@r-project.org
> Date: Thu, 11 Nov 2010 08:03:57 -0500
> Subject: Re: [R] log-transformed linear regression
>
>
>
>
>
>
> 
> > Date: Wed, 10 Nov 2010 19:27:20 -0500
> > From: sa.cizm...@usherbrooke.ca
> > To: r-help@r-project.org
> > Subject: Re: [R] log-transformed linear regression
> >
> > Dear List,
> >
> > I would like to take another chance and see if there if someone has
> > anything to say to my last post...
>
> ok, I took a look at it to help myself re-learn some R so
> I solicit better alts from R users in comments below.
>
> If I expand scale down where you have lots of points, the visual
> is something like a plateau for x<1 and then a general noisy
> trend upward- the slope you get for the aggregate blob of
> data probably depends on how your sampling is weighted. I wouldn't
> worry about the one point at large X in isolation quite yet but
> you may want to consider a piecewise linear or other way to handle
> the low X valued data. You may just want to spend more time staring
> at pictures before churning out p-values LOL.
>
> To be clear about your log issues, lm will fit a linear model to the stuff you
> give it and of course y=a*x means something different from log(y) =a*log(x).
> Overall, you need to figure out what to do with the models depending on
> what you think is a candidate for being real- indeed your points at large
> X and Y may be out of the linear range for this "system" but who knows.
>
> If you just try to fit a bunch of data to a line,
> the slope of course depends on which part of this bunch you have
> sampled. Generally to explore sensitivity, try things like this,
> with ".0" of course replaced by whatever threshold you want.
> It is easy to automate, put this in a loop and plot historgrams
> of the resultant slopes etc.
>
> > sel=runif(length(X))>.0
> > aasel=lm(Y[sel]~X[sel])
> > summary(aasel)
>
> Call:
> lm(formula = Y[sel] ~ X[sel])
>
> Residuals:
>   Min1QMedian3Q   Max
> -0.221957 -0.004207  0.004055  0.013395  0.232362
>
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) -0.0256642  0.0033254  -7.718 1.54e-12 ***
> X[sel]   0.0463599  0.0003146 147.365  < 2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 0.0399 on 150 degrees of freedom
> Multiple R-squared: 0.9931, Adjusted R-squared: 0.9931
> F-statistic: 2.172e+04 on 1 and 150 DF,  p-value: < 2.2e-16
>
> if you increase ".0" to something like .5 or so and keep repating
> this you get some idea of what can happen, make changes for the various
> models and then decide what you are trying to do etc.
>
>
>
> > sel=runif(length(X))>.9
> > aasel=lm(Y[sel]~X[sel])
> > summary(aasel)
>
> Call:
> lm(formula = Y[sel] ~ X[sel])
>
> Residuals:
>   Min1QMedian3Q   Max
> -0.199943 -0.002839  0.010626  0.019608  0.149099
>
> Coefficients:
>  Estimate Std. Error t value Pr(>|t|)
> (Intercept) -0.035546   0.013741  -2.587   0.0162 *
> X[sel]   0.052033   0.003401  15.301 7.04e-14 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 0.05878 on 24 degrees of freedom
> Multiple R-squared: 0.907,  Adjusted R-squared: 0.9031
> F-statistic: 234.1 on 1 and 24 DF,  p-value: 7.037e-14
>
>
> >
> > bump
> >
> > servet
> >
> >
> > On 11/10/2010 01:11 PM, servet cizmeli wrote:
> > > Hello,
> > >
> > > I have a basic question. Sorry if it is so evident
> > >
> > > I have the following d

Re: [R] log-transformed linear regression

2010-11-11 Thread Matt Shotwell
Servet,

These data do look linear in log space. Fortunately, the model

log(y) = a + b * log(x)

does have intercept zero in linear space. To see this, consider

log(y) = a + b * log(x)
 y = 10^(a + b * log(x))
 y = 10^a * 10^(b * log(x))
 y = 10^a * 10^(log(x^b))
 y = 10^a * x^b

Hence, y = 0 when x = 0. The code below estimates a and b. 

Of course,

y = 10^a * x^b 

is not a line, so we can't directly compare slopes. However, in the
region of your data, the estimated mean is _nearly_ linear. In fact, you
could consider looking at a linear approximation, say at the median of
your x values. The median of your x values is 0.958. For simplicity,
let's just say it's 1.0. The linear approximation (first order Taylor
expansion) of 

y = 10^a * x^b

at x = 1 is

y = 10^a + 10^a * b * (x - 1)
y = 10^a * (1 - b) + 10^a * b * x

So, the slope of the linear approximation is 10^a * b, and the intercept
is 10^a * (1 - b). Taking a and b from the analysis below, the
approximate intercept is -0.00442, and slope 0.22650. You could argue
that these values are consistent with the literature, but that the log
linear model is more appropriate for these data. You could even
construct a bootstrap confidence interval for the approximate slope.

-Matt

On Wed, 2010-11-10 at 19:27 -0500, servet cizmeli wrote:
> Dear List,
> 
> I would like to take another chance and see if there if someone has 
> anything to say to my last post...
> 
> bump
> 
> servet
> 
> 
> On 11/10/2010 01:11 PM, servet cizmeli wrote:
> > Hello,
> >
> > I have a basic question. Sorry if it is so evident
> >
> > I have the following data file :
> > http://ekumen.homelinux.net/mydata.txt
> >
> > I need to model Y~X-1 (simple linear regression through the origin) with
> > these data :
> >
> > load(file="mydata.txt")
> > X=k[,1]
> > Y=k[,2]
> >
> > aa=lm(Y~X-1)
> > dev.new()
> > plot(X,Y,log="xy")
> > abline(aa,untf=T)
> > abline(b=0.0235, a=0,col="red",untf=T)
> > abline(b=0.031, a=0,col="green",untf=T)
> >
> > Other people did the same kind of analysis with their data and found the
> > regression coefficients of 0.0235 (red line) and 0.031 (green line).
> >
> > Regression with my own data, though, yields a slope of 0.0458 (black
> > line) which is too high. Clearly my regression is too much influenced by
> > the single point with high values (X>100). I would not like to discard
> > this point, though, because I know that the measurement is correct. I
> > just would like to give it less weight...
> >
> > When I log-transform X and Y data, I obtain :
> >
> > dev.new()
> > plot(log10(X),log10(Y))
> > abline(v=0,h=0,col="cyan")
> > bb=lm(log10(Y)~log10(X))
> > abline(bb,col="blue")
> > bb
> >
> > I am happy with this regression. Now the slope is at the log-log domain.
> > I have to convert it back so that I can obtain a number comparable with
> > the literature (0.0235 and 0.031).  How to do it? I can't force the
> > second regression through the origin as the log-transformed data does
> > not go through the origin anymore.
> >
> > at first it seemed like an easy problem but I am at loss :o((
> > thanks a lot for your kindly help
> > servet
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Matthew S. Shotwell
Graduate Student 
Division of Biostatistics and Epidemiology
Medical University of South Carolina

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.