[R] Goodness of Fit for Word Frequency Counts

2010-03-25 Thread Thiemo Fetzer

Dear Mailing list!

sorry to bother you - but maybe you can help me out. I have been searching
and searching for appropriate tests.

I have a huge dataset of loan requests and I have data at portfolio level,
with average portfolio size 200 loans. I want to test whether portfolios are
randomly drawn. The problem is that I have rather qualitative data, namely I
want to characterize whether loans are randomly selected using word counts.

For each loan, I have a "sector", "activity" and "use description". The "use
description" contains about 15 words, the "activity" description is usually
only one or two words.

What I did as of now was to find the word-counts in the overall portfolio,
which is 110,000 loans. From this, I can compute, based on knowing the size
of a team portfolio the expected frequency of certain keywords appearing.
The "sector" variable is categorical and can take only 17 values, whereas in
the overall distribution I found 180 different words as activity
description.

I now wanted to do a type of "goodness of fit" test to see whether the
portfolios are randomly selected or not. I would expect that certain
portfolios are indeed randomly selected, whereas others arent. 

I did a chi^2, Pearson Tukey and G-test of Goodness of Fit. The problem is
that these tests are usually constructed for categorical data - but if I use
the "activity" word-count it need not be categorical. So I am wondering,
whether this is still appropriate? 

I may have a portfolio of 200 loans in which certain words never appear. In
this case, I am not sure which degrees of freedom to look at. Should I use
as prescribed 179 degrees of freedom as I have 180 "categories" - but these
"arent" real categories...

An example may look as follows - the word is on the left, the expected and
observed word counts are given:


 +---+
 |word  observed   expected  |
 |---|
  1. |   food 54   57.511776 |
  2. | retail 4649.04432 |
  3. |agriculture 39   36.557732 |
  4. |   services 23   15.867387 |
  5. |   clothing 13   14.126975 |
 |---|
  6. | transportation 10   6.5851929 |
  7. |housing  3 4.65019 |
  8. |   construction  2   4.3173841 |
  9. |   arts  5   4.2500955 |
 10. |  manufacturing  1   3.0170768 |
 |---|
 11. | health  21.751323 |
 12. |use  0   .55215646 |
 13. |   personal  0   .25241221 |
 14. |  education  2   .68743521 |
 15. |  wholesale  0   .11241227 |
 |---|
 16. |  entertainment  0   .32743521 |
 17. |  green  0   .42743782 |
 +---+

I can have R calculate the ChiĀ² statistic from this, but should I use now 17
degrees of freedom? The problem is this is not categorical data! In this
case, do I have to make comparisons on a word-by-word basis? Like a
"Bernoulli"?

I was looking for other goodness of fit tests for this kind of data for days
now, but I cant really find any others!

I really appreciate your thoughts,

best Thiemo


---
http://freigeist.devmag.net

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R book for economists

2009-08-01 Thread Thiemo Fetzer
Dear Group,

I am an economics student starting with PhD work in London. As preparation I
would like to get to know R a little bit better. For Stata there are tons of
books, however, can you recommend a book for R?

I have some substantiated econometrics knowledge, so it should be more a
how-to book.

Best regards
Thiemo

---
Thiemo Fetzer, Economist
http://freigeist.devmag.net
http://www.devmag.net

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regression inclusion of variable, effect on coefficients

2008-04-21 Thread Thiemo Fetzer
Hello!

I was thinking again about the possible interaction between x1 and x4.

Theoretically it makes sense, that the influence of x4 on y is the stronger,
the less informative is x1. It can be argued that the higher x1, the less
informative it is x1.

How could I incorporate this relationship in the model? 


Thanks a lot for your help in advance,

Thiemo

-Original Message-
From: Uwe Ligges [mailto:[EMAIL PROTECTED] 
Sent: Montag, 21. April 2008 18:54
To: Thiemo Fetzer
Cc: r-help@r-project.org
Subject: Re: [R] Regression inclusion of variable, effect on coefficients

This is not a dump question. This is a serious problem and it depends on 
what you know or assume about the relastionship between x1 and x4. If 
you assume linear interaction, you might want to introduce some 
interaction term to the model for example.

Uwe Ligges


Thiemo Fetzer wrote:
> Hello dear R users!
> 
> I know this question is not strictly R-help, yet, maybe some of the guru's
> in statistics can help me out.
> 
>  
> 
> I have a sample of data all from the same "population". Say my regression
> equation is now this:
> 
>  
> 
> m1 <- lm(y ~ x1 + x2 + x3) 
> 
>  
> 
> I also regress on
> 
>  
> 
> m2 <- lm(y ~ x1 + x2 + x3 + x4)
> 
>  
> 
> The thing is, that I want to study the effect of "information" x4.
> 
>  
> 
> I would hypothesize, that the coefficient estimate for x1 goes down as I
> introduce x4, as x4 conveys some of the information conveyed by x1 (but
not
> only). Of course x1 and x4 are correlated, however multicollinearity does
> not appear to be a problem, the variance inflation factors are rather low
> (around 1.5 or so).
> 
>  
> 
> I want to basically study, how the interplay between x1 and x4 is, when
> introducing x4 into the regression equation and whether my hypothesis is
> correct; i.e. that given I consider the information x4, not so much of the
> variation is explained via x1 anymore.
> 
>  
> 
> I observe that introducing x4 into the regression, the coefficient
estimate
> for x1 goes down; also the associated p-value becomes bigger; i.e. x1
> becomes comparatively less significant. However, x4 is not significant.
Yet,
> the observation is in line with my theoretical argument.
> 
>  
> 
> The question is now simple: how can I work this out?
> 
>  
> 
> I know this is likely a dumb question, but I would really appreciate some
> links or help.
> 
> 
> Regards
> 
> Thiemo
> 
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regression inclusion of variable, effect on coefficients

2008-04-21 Thread Thiemo Fetzer
Hello :)

I am happy to hear that I am not necessarily asking stupid questions.

The thing is, that I have data on x1 and x4 for the whole sample. However,
theoretically, it is clear that the informational content of x1 is not as
high as of x4. x4 provides more accurate information to the subjects
participating in the game, as it has been experimentally and theoretically
shown that the x1 is biased.

So the experimentators introduced x4 in response to the biased x1. Both
prevail however together, so that the subjects have available information on
x1 and x4. 

Theoretically, I argued that the "relative importance" of x1 on y will
decrease in light that information x4 is available, as x4 is more accurate.

With a simple regression, however, I do not find significant relationships.
For x1 it has been empirically and theoretically shown that it has a
positive effect on y. The same should hold for x4.

There is no necessary theoretical argument as how x1 and x4 interact
mathematically, as they both are a measure of the same thing. Yet, x4 is
more accurate and contains even more information.  It could be any kind of
interaction. They are positively correlated, which is also reasonable.

Could you suggest me a simple interaction model, with which I could try my
luck?

Thanks a lot

Thiemo 

-Original Message-
From: Uwe Ligges [mailto:[EMAIL PROTECTED] 
Sent: Montag, 21. April 2008 18:54
To: Thiemo Fetzer
Cc: r-help@r-project.org
Subject: Re: [R] Regression inclusion of variable, effect on coefficients

This is not a dump question. This is a serious problem and it depends on 
what you know or assume about the relastionship between x1 and x4. If 
you assume linear interaction, you might want to introduce some 
interaction term to the model for example.

Uwe Ligges


Thiemo Fetzer wrote:
> Hello dear R users!
> 
> I know this question is not strictly R-help, yet, maybe some of the guru's
> in statistics can help me out.
> 
>  
> 
> I have a sample of data all from the same "population". Say my regression
> equation is now this:
> 
>  
> 
> m1 <- lm(y ~ x1 + x2 + x3) 
> 
>  
> 
> I also regress on
> 
>  
> 
> m2 <- lm(y ~ x1 + x2 + x3 + x4)
> 
>  
> 
> The thing is, that I want to study the effect of "information" x4.
> 
>  
> 
> I would hypothesize, that the coefficient estimate for x1 goes down as I
> introduce x4, as x4 conveys some of the information conveyed by x1 (but
not
> only). Of course x1 and x4 are correlated, however multicollinearity does
> not appear to be a problem, the variance inflation factors are rather low
> (around 1.5 or so).
> 
>  
> 
> I want to basically study, how the interplay between x1 and x4 is, when
> introducing x4 into the regression equation and whether my hypothesis is
> correct; i.e. that given I consider the information x4, not so much of the
> variation is explained via x1 anymore.
> 
>  
> 
> I observe that introducing x4 into the regression, the coefficient
estimate
> for x1 goes down; also the associated p-value becomes bigger; i.e. x1
> becomes comparatively less significant. However, x4 is not significant.
Yet,
> the observation is in line with my theoretical argument.
> 
>  
> 
> The question is now simple: how can I work this out?
> 
>  
> 
> I know this is likely a dumb question, but I would really appreciate some
> links or help.
> 
> 
> Regards
> 
> Thiemo
> 
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regression inclusion of variable, effect on coefficients

2008-04-21 Thread Thiemo Fetzer
Hello dear R users!

I know this question is not strictly R-help, yet, maybe some of the guru's
in statistics can help me out.

 

I have a sample of data all from the same "population". Say my regression
equation is now this:

 

m1 <- lm(y ~ x1 + x2 + x3) 

 

I also regress on

 

m2 <- lm(y ~ x1 + x2 + x3 + x4)

 

The thing is, that I want to study the effect of "information" x4.

 

I would hypothesize, that the coefficient estimate for x1 goes down as I
introduce x4, as x4 conveys some of the information conveyed by x1 (but not
only). Of course x1 and x4 are correlated, however multicollinearity does
not appear to be a problem, the variance inflation factors are rather low
(around 1.5 or so).

 

I want to basically study, how the interplay between x1 and x4 is, when
introducing x4 into the regression equation and whether my hypothesis is
correct; i.e. that given I consider the information x4, not so much of the
variation is explained via x1 anymore.

 

I observe that introducing x4 into the regression, the coefficient estimate
for x1 goes down; also the associated p-value becomes bigger; i.e. x1
becomes comparatively less significant. However, x4 is not significant. Yet,
the observation is in line with my theoretical argument.

 

The question is now simple: how can I work this out?

 

I know this is likely a dumb question, but I would really appreciate some
links or help.


Regards

Thiemo


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Format regression result summary

2008-04-11 Thread Thiemo Fetzer
Hello to the whole group.

I am a newbie to R, but I got my way through and think it is a lot easier to
handle than other software packages (far less clicks necessary).

However, I have a problem with respect to the summary of regression results.

The summary function gives sth like:

Residuals:
 Min   1Q   Median   3Q  Max 
-0.46743 -0.09772  0.01810  0.11175  0.42252 

Coefficients:
 Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.750367   0.172345  21.761  < 2e-16 ***
Var1  -0.002334   0.009342  -0.250 0.802948
Var2 0.012551   0.005927   2.117 0.035444 *

Var3   0.015380   0.074537   0.206 0.836730
Var3   0.098602   0.026448   3.728 0.000250 ***
...

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.1614 on 202 degrees of freedom
Multiple R-squared: 0.1983, Adjusted R-squared: 0.1506 
F-statistic: 4.163 on 12 and 202 DF,  p-value: 7.759e-06

However, my wish is the output to have a format like:

 Estimate 
(Intercept)  3.750367*** 
(0.172345)
Var1  -0.002334
(0.009342)
Var2 0.012551*
(0.005927)

Etc. so that the standard errors are in parantheses below the estimates.
Next to the estimates should be the * indicating significance.

I thought that should go by accessing the elements in the summary object,
yet, I got started and figured that is quite complicated. 

Is there a quick and dirty way? 
Basically I want the same print-out as the summary, except that I don't want
the t-statistic and not the p-value, only the significance codes.

Thanks a lot in advance

Thiemo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.