Weiwei Shi wrote: >Hi, >I have a text mining project and currently I am working on feature >generation/selection part. >My plan is selecting a set of words or word combinations which have >better discriminant capability than other words in telling the group >id's (2 classes in this case) for a dataset which has 2,000,000 >documents. > >One approach is using "contrast-set association rule mining" while the >other is using chisqr or fisher exact test. > >An example which has 3 contingency tables for 3 words as followed >(word coded by number): > > >>tab[,,1:3] >> >> >, , 1 > > [,1] [,2] >[1,] 11266 2151526 >[2,] 125 31734 > >, , 2 > > [,1] [,2] >[1,] 43571 2119221 >[2,] 52 31807 > >, , 3 > > [,1] [,2] >[1,] 427 2162365 >[2,] 5 31854 > > >I have some questions on this: >1. What's the thumb of rule to use chisq test instead of Fisher exact >test. I have a vague memory which said for each cell, the count needs >to be over 50 if chisq instead of fisher exact test is going to be >used. In the case of word 3, I think I should use fisher test. >However, running chisq like below is fine: > > >>tab[,,3] >> >> > [,1] [,2] >[1,] 427 2162365 >[2,] 5 31854 > > >>chisq.test(tab[,,3]) >> >> > > Pearson's Chi-squared test with Yates' continuity correction > >data: tab[, , 3] >X-squared = 0.0963, df = 1, p-value = 0.7564 > >but running on the whole set of words (including 14240 words) has the >following warnings: > > >>p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value)) >> >> >There were 50 or more warnings (use warnings() to see the first 50) > > >>warnings() >> >> >Warning messages: >1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i]) >2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i]) >3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i]) >4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i]) > > >2. So, my second question is, is this warning b/c I am against the >assumption of using chisq. But why Word 3 is fine? How to trace the >warning to see which word caused this warning? > >3. My result looks like this (after some mapping treating from number >id to word and some words are stemmed here, like ACCID is accident): > > of[1:50,] > map...2. p.fisher >21 ACCID 0.000000e+00 >30 CD 0.000000e+00 >67 ROCK 0.000000e+00 >104 CRACK 0.000000e+00 >111 CHIP 0.000000e+00 >179 GLASS 0.000000e+00 >84 BACK 4.199878e-291 >395 DRIVEABL 5.335989e-287 >60 CAP 9.405235e-285 >262 WINDSHIELD 2.691641e-254 >13 IV 3.905186e-245 >110 HZ 2.819713e-210 >11 CAMP 9.086768e-207 >2 SHATTER 5.273994e-202 >297 ALP 1.678521e-177 >162 BED 1.822031e-173 >249 BCD 1.398391e-160 >493 RACK 4.178617e-156 >59 CAUS 7.539031e-147 > >3.1 question: Should I use two-sided test instead of one-sided for >fisher test? I read some material which suggests using two-sided. > >3.2 A big question: Even though the result looks very promising since >this is case of classiying fraud cases and the words selected by this >approach make sense. However, I think p-values here just indicate the >strength to reject null hypothesis, not the strength of association >between word and class of document. So, what kind of statistics I >should use here to evaluate the strength of association? odds ratio? > >Any suggestions are welcome! > >Thanks! > > You can use chisq.test with sim=TRUE, or call it as usual first, see if there is a warning, and then recall with sim=TRUE.
Kjetil -- Kjetil Halvorsen. Peace is the most effective weapon of mass construction. -- Mahdi Elmandjra -- No virus found in this outgoing message. Checked by AVG Anti-Virus. ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html