[R] chisq test and fisher exact test

2005-06-22 Thread Weiwei Shi
Hi,
I have a text mining project and currently I am working on feature
generation/selection part.
My plan is selecting a set of words or word combinations which have
better discriminant capability than other words in telling the group
id's (2 classes in this case) for a dataset which has 2,000,000
documents.

One approach is using "contrast-set association rule mining" while the
other is using chisqr or fisher exact test.

An example which has 3 contingency tables for 3 words as followed
(word coded by number):
> tab[,,1:3]
, , 1

  [,1][,2]
[1,] 11266 2151526
[2,]   125   31734

, , 2

  [,1][,2]
[1,] 43571 2119221
[2,]52   31807

, , 3

 [,1][,2]
[1,]  427 2162365
[2,]5   31854


I have some questions on this:
1. What's the thumb of rule to use chisq test instead of Fisher exact
test. I have a  vague memory which said for each cell, the count needs
to be over 50 if chisq instead of fisher exact test is going to be
used. In the case of word 3,  I think I should use fisher test.
However, running chisq like below is fine:
> tab[,,3]
 [,1][,2]
[1,]  427 2162365
[2,]5   31854
> chisq.test(tab[,,3])

Pearson's Chi-squared test with Yates' continuity correction

data:  tab[, , 3]
X-squared = 0.0963, df = 1, p-value = 0.7564

but running on the whole set of words (including 14240 words) has the
following warnings:
> p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value))
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])


2. So, my second question is, is this warning b/c I am against the
assumption of using chisq. But why Word 3 is fine? How to trace the
warning to see which word caused this warning?

3. My result looks like this (after some mapping treating from number
id to word and some words are stemmed here, like ACCID is accident):
 > of[1:50,]
  map...2.  p.fisher
21   ACCID  0.00e+00
30  CD  0.00e+00
67ROCK  0.00e+00
104  CRACK  0.00e+00
111   CHIP  0.00e+00
179  GLASS  0.00e+00
84BACK 4.199878e-291
395   DRIVEABL 5.335989e-287
60 CAP 9.405235e-285
262 WINDSHIELD 2.691641e-254
13  IV 3.905186e-245
110 HZ 2.819713e-210
11CAMP 9.086768e-207
2  SHATTER 5.273994e-202
297ALP 1.678521e-177
162BED 1.822031e-173
249BCD 1.398391e-160
493   RACK 4.178617e-156
59CAUS 7.539031e-147

3.1 question: Should I use two-sided test instead of one-sided for
fisher test? I read some material which suggests using two-sided.

3.2 A big question: Even though the result looks very promising since
this is case of classiying fraud cases and the words selected by this
approach make sense. However, I think p-values here just indicate the
strength to reject null hypothesis, not the strength of association
between word and class of document. So, what kind of statistics I
should use here to evaluate the strength of association? odds ratio?

Any suggestions are welcome!

Thanks!
-- 
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] chisq test and fisher exact test

2005-06-22 Thread Kjetil Brinchmann Halvorsen
Weiwei Shi wrote:

>Hi,
>I have a text mining project and currently I am working on feature
>generation/selection part.
>My plan is selecting a set of words or word combinations which have
>better discriminant capability than other words in telling the group
>id's (2 classes in this case) for a dataset which has 2,000,000
>documents.
>
>One approach is using "contrast-set association rule mining" while the
>other is using chisqr or fisher exact test.
>
>An example which has 3 contingency tables for 3 words as followed
>(word coded by number):
>  
>
>>tab[,,1:3]
>>
>>
>, , 1
>
>  [,1][,2]
>[1,] 11266 2151526
>[2,]   125   31734
>
>, , 2
>
>  [,1][,2]
>[1,] 43571 2119221
>[2,]52   31807
>
>, , 3
>
> [,1][,2]
>[1,]  427 2162365
>[2,]5   31854
>
>
>I have some questions on this:
>1. What's the thumb of rule to use chisq test instead of Fisher exact
>test. I have a  vague memory which said for each cell, the count needs
>to be over 50 if chisq instead of fisher exact test is going to be
>used. In the case of word 3,  I think I should use fisher test.
>However, running chisq like below is fine:
>  
>
>>tab[,,3]
>>
>>
> [,1][,2]
>[1,]  427 2162365
>[2,]5   31854
>  
>
>>chisq.test(tab[,,3])
>>
>>
>
>Pearson's Chi-squared test with Yates' continuity correction
>
>data:  tab[, , 3]
>X-squared = 0.0963, df = 1, p-value = 0.7564
>
>but running on the whole set of words (including 14240 words) has the
>following warnings:
>  
>
>>p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value))
>>
>>
>There were 50 or more warnings (use warnings() to see the first 50)
>  
>
>>warnings()
>>
>>
>Warning messages:
>1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>
>
>2. So, my second question is, is this warning b/c I am against the
>assumption of using chisq. But why Word 3 is fine? How to trace the
>warning to see which word caused this warning?
>
>3. My result looks like this (after some mapping treating from number
>id to word and some words are stemmed here, like ACCID is accident):
> > of[1:50,]
>  map...2.  p.fisher
>21   ACCID  0.00e+00
>30  CD  0.00e+00
>67ROCK  0.00e+00
>104  CRACK  0.00e+00
>111   CHIP  0.00e+00
>179  GLASS  0.00e+00
>84BACK 4.199878e-291
>395   DRIVEABL 5.335989e-287
>60 CAP 9.405235e-285
>262 WINDSHIELD 2.691641e-254
>13  IV 3.905186e-245
>110 HZ 2.819713e-210
>11CAMP 9.086768e-207
>2  SHATTER 5.273994e-202
>297ALP 1.678521e-177
>162BED 1.822031e-173
>249BCD 1.398391e-160
>493   RACK 4.178617e-156
>59CAUS 7.539031e-147
>
>3.1 question: Should I use two-sided test instead of one-sided for
>fisher test? I read some material which suggests using two-sided.
>
>3.2 A big question: Even though the result looks very promising since
>this is case of classiying fraud cases and the words selected by this
>approach make sense. However, I think p-values here just indicate the
>strength to reject null hypothesis, not the strength of association
>between word and class of document. So, what kind of statistics I
>should use here to evaluate the strength of association? odds ratio?
>
>Any suggestions are welcome!
>
>Thanks!
>  
>
You can use chisq.test with sim=TRUE, or call it as usual first, see if 
there is a warning, and then recall
with sim=TRUE.

Kjetil

-- 

Kjetil Halvorsen.

Peace is the most effective weapon of mass construction.
   --  Mahdi Elmandjra




-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] chisq test and fisher exact test

2005-06-22 Thread Weiwei Shi
Is it b/c my question is too long so no one answers it? I should have
splitted it. :(

On 6/22/05, Kjetil Brinchmann Halvorsen <[EMAIL PROTECTED]> wrote:
> Weiwei Shi wrote:
> 
> >Hi,
> >I have a text mining project and currently I am working on feature
> >generation/selection part.
> >My plan is selecting a set of words or word combinations which have
> >better discriminant capability than other words in telling the group
> >id's (2 classes in this case) for a dataset which has 2,000,000
> >documents.
> >
> >One approach is using "contrast-set association rule mining" while the
> >other is using chisqr or fisher exact test.
> >
> >An example which has 3 contingency tables for 3 words as followed
> >(word coded by number):
> >
> >
> >>tab[,,1:3]
> >>
> >>
> >, , 1
> >
> >  [,1][,2]
> >[1,] 11266 2151526
> >[2,]   125   31734
> >
> >, , 2
> >
> >  [,1][,2]
> >[1,] 43571 2119221
> >[2,]52   31807
> >
> >, , 3
> >
> > [,1][,2]
> >[1,]  427 2162365
> >[2,]5   31854
> >
> >
> >I have some questions on this:
> >1. What's the thumb of rule to use chisq test instead of Fisher exact
> >test. I have a  vague memory which said for each cell, the count needs
> >to be over 50 if chisq instead of fisher exact test is going to be
> >used. In the case of word 3,  I think I should use fisher test.
> >However, running chisq like below is fine:
> >
> >
> >>tab[,,3]
> >>
> >>
> > [,1][,2]
> >[1,]  427 2162365
> >[2,]5   31854
> >
> >
> >>chisq.test(tab[,,3])
> >>
> >>
> >
> >Pearson's Chi-squared test with Yates' continuity correction
> >
> >data:  tab[, , 3]
> >X-squared = 0.0963, df = 1, p-value = 0.7564
> >
> >but running on the whole set of words (including 14240 words) has the
> >following warnings:
> >
> >
> >>p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value))
> >>
> >>
> >There were 50 or more warnings (use warnings() to see the first 50)
> >
> >
> >>warnings()
> >>
> >>
> >Warning messages:
> >1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >
> >
> >2. So, my second question is, is this warning b/c I am against the
> >assumption of using chisq. But why Word 3 is fine? How to trace the
> >warning to see which word caused this warning?
> >
> >3. My result looks like this (after some mapping treating from number
> >id to word and some words are stemmed here, like ACCID is accident):
> > > of[1:50,]
> >  map...2.  p.fisher
> >21   ACCID  0.00e+00
> >30  CD  0.00e+00
> >67ROCK  0.00e+00
> >104  CRACK  0.00e+00
> >111   CHIP  0.00e+00
> >179  GLASS  0.00e+00
> >84BACK 4.199878e-291
> >395   DRIVEABL 5.335989e-287
> >60 CAP 9.405235e-285
> >262 WINDSHIELD 2.691641e-254
> >13  IV 3.905186e-245
> >110 HZ 2.819713e-210
> >11CAMP 9.086768e-207
> >2  SHATTER 5.273994e-202
> >297ALP 1.678521e-177
> >162BED 1.822031e-173
> >249BCD 1.398391e-160
> >493   RACK 4.178617e-156
> >59CAUS 7.539031e-147
> >
> >3.1 question: Should I use two-sided test instead of one-sided for
> >fisher test? I read some material which suggests using two-sided.
> >
> >3.2 A big question: Even though the result looks very promising since
> >this is case of classiying fraud cases and the words selected by this
> >approach make sense. However, I think p-values here just indicate the
> >strength to reject null hypothesis, not the strength of association
> >between word and class of document. So, what kind of statistics I
> >should use here to evaluate the strength of association? odds ratio?
> >
> >Any suggestions are welcome!
> >
> >Thanks!
> >
> >
> You can use chisq.test with sim=TRUE, or call it as usual first, see if
> there is a warning, and then recall
> with sim=TRUE.
> 
> Kjetil
> 
> --
> 
> Kjetil Halvorsen.
> 
> Peace is the most effective weapon of mass construction.
>--  Mahdi Elmandjra
> 
> 
> 
> 
> --
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.323 / Virus Database: 267.7.7/20 - Release Date: 16/06/2005
> 
> 


-- 
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html