Re: [R] Essay identification
This topic is sometimes called wordprinting or stylometry. The spring 2003 issue of Chance magazine had several articles on the topic. A colleague of mine and I have been working on a perl program (along with various graduate students) to extract many of the common statistics used in wordprinting (counts/percentages of non-contextual words, word pattern ratios, vocabulary richness). The data can then be loaded into R (or any other stats package) to be analyzed. The program is currently in a beta state (usable, but we want to possibly add more features and documentation), but I can send a copy to anyone who is interested (specify if you have perl, or need a stand alone copy (windows only)). hope this helps, Greg Snow, Ph.D. Statistical Data Center, LDS Hospital Intermountain Health Care [EMAIL PROTECTED] (801) 408-8111 >>> Werner Bier <[EMAIL PROTECTED]> 06/12/05 01:29PM >>> Hi R-help, I have a database of 10 students who have written an overall of 78 essays. The challenge? I would like to identify who wrote the 79th essay. Has anybody used R in this context? Even if not, would you suggest me which pattern recognition technique I might possibly apply? Thanks a lot and regards, Tom - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Essay identification
Thank you so much for all your answers. Papers, codes, examples, methods...THANKS A LOT! :-) P.S. Thanks to Richard R, Berton, Gabor, Roger P, Ted H et all :-) [EMAIL PROTECTED] wrote: On 12-Jun-05 Berton Gunter wrote: > I assume that you know the usual procedure is to 'score' > each essay by a vector that gives the frequency of occurrence > of commonly used (sometimes adding subject matter specific) > words and phrases. This multivariate response is then fed in > as a "training set" into your favorite supervised > learning/classification procedure. R has many of these -- trees, > logisic regression, boosting, Random Forests,svm's,LDA,SOM's > (whoops -- that's an Unsupervised one), ... . Try > RSiteSearch('Classification',restrict=('functions'). > > The devil is in the details as to what works best, I believe. > With only 78 exemplars in 10 groups, unless there is a lot of > separation (disparate styles that you could probably detect > manually) it may be difficult. It also depends on how large > each group is (balance is generally better). > > Cheers, > Bert I would add to Berton's list such scores as numbers of different words used, sentence lengths, relative frequencies of verbs, nouns, adjectives, adverbs, and so on, perhaps scaled by overall length. Length of Essay might even be a discriminant! You could also look at more subtle characteristics such as "Zipf bins"[*] -- the relative numbers of different words which occur once only, twice, three times, ... (though I'm not sure how you would score such a thing for classification purposes). [*] A term I've just invented inspired by the original instance of this by the linguist Zipf, later giving rise to the logarithmic distribution in the historic paper by Fisher, Corbett & Williams in the "Numbers of Species and Numbers of Individuals" in butterfly traps. If you really want to go to town you can try things related to grammatical complexity, e.g. numbers of subordinate clauses per sentence, relative clauses, the "reach" of relative pronouns (how far from the referring pronoun is the thing referred to) and so on. There's quite an extensive literature on this sort of thing. though it's not as fashionable as it used to be. Th real problem is that you can get carried away by "good ideas" of things to try! The other factor to bear in mind is that if the Essays can be grouped by subject this is likely to influence many of the scores (such as the above). Hoping this helps and does not distract! Ted. E-Mail: (Ted Harding) Fax-to-email: +44 (0)870 094 0861 Date: 13-Jun-05 Time: 00:43:10 -- XFMail -- - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Essay identification
On 12-Jun-05 Berton Gunter wrote: > I assume that you know the usual procedure is to 'score' > each essay by a vector that gives the frequency of occurrence > of commonly used (sometimes adding subject matter specific) > words and phrases. This multivariate response is then fed in > as a "training set" into your favorite supervised > learning/classification procedure. R has many of these -- trees, > logisic regression, boosting, Random Forests,svm's,LDA,SOM's > (whoops -- that's an Unsupervised one), ... . Try > RSiteSearch('Classification',restrict=('functions'). > > The devil is in the details as to what works best, I believe. > With only 78 exemplars in 10 groups, unless there is a lot of > separation (disparate styles that you could probably detect > manually) it may be difficult. It also depends on how large > each group is (balance is generally better). > > Cheers, > Bert I would add to Berton's list such scores as numbers of different words used, sentence lengths, relative frequencies of verbs, nouns, adjectives, adverbs, and so on, perhaps scaled by overall length. Length of Essay might even be a discriminant! You could also look at more subtle characteristics such as "Zipf bins"[*] -- the relative numbers of different words which occur once only, twice, three times, ... (though I'm not sure how you would score such a thing for classification purposes). [*] A term I've just invented inspired by the original instance of this by the linguist Zipf, later giving rise to the logarithmic distribution in the historic paper by Fisher, Corbett & Williams in the "Numbers of Species and Numbers of Individuals" in butterfly traps. If you really want to go to town you can try things related to grammatical complexity, e.g. numbers of subordinate clauses per sentence, relative clauses, the "reach" of relative pronouns (how far from the referring pronoun is the thing referred to) and so on. There's quite an extensive literature on this sort of thing. though it's not as fashionable as it used to be. Th real problem is that you can get carried away by "good ideas" of things to try! The other factor to bear in mind is that if the Essays can be grouped by subject this is likely to influence many of the scores (such as the above). Hoping this helps and does not distract! Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 13-Jun-05 Time: 00:43:10 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Essay identification
On 6/12/05, Werner Bier <[EMAIL PROTECTED]> wrote: > Hi R-help, > > I have a database of 10 students who have written an overall of 78 essays. > The challenge? I would like to identify who wrote the 79th essay. > > Has anybody used R in this context? > > Even if not, would you suggest me which pattern recognition technique I might > possibly apply? Check out http://xxx.uni-augsburg.de/PS_cache/cond-mat/pdf/0108/0108530.pdf for a simple method. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Essay identification
I assume that you know the usual procedure is to 'score' each essay by a vector that gives the frequency of occurrence of commonly used (sometimes adding subject matter specific) words and phrases. This multivariate response is then fed in as a "training set" into your favorite supervised learning/classification procedure. R has many of these -- trees, logisic regression, boosting, Random Forests,svm's,LDA,SOM's (whoops -- that's an Unsupervised one), ... . Try RSiteSearch('Classification',restrict=('functions'). The devil is in the details as to what works best, I believe. With only 78 exemplars in 10 groups, unless there is a lot of separation (disparate styles that you could probably detect manually) it may be difficult. It also depends on how large each group is (balance is generally better). Cheers, Bert -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Werner Bier Sent: Sunday, June 12, 2005 12:30 PM To: r-help@stat.math.ethz.ch Subject: [R] Essay identification Hi R-help, I have a database of 10 students who have written an overall of 78 essays. The challenge? I would like to identify who wrote the 79th essay. Has anybody used R in this context? Even if not, would you suggest me which pattern recognition technique I might possibly apply? Thanks a lot and regards, Tom - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Essay identification
Hi R-help, I have a database of 10 students who have written an overall of 78 essays. The challenge? I would like to identify who wrote the 79th essay. Has anybody used R in this context? Even if not, would you suggest me which pattern recognition technique I might possibly apply? Thanks a lot and regards, Tom - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html