Re: [R] Essay identification

2005-06-13 Thread Greg Snow
This topic is sometimes called wordprinting or stylometry.  The spring
2003 issue of Chance magazine had several articles on the topic.

A colleague of mine and I have been working on a perl program (along
with various graduate students) to extract many of the common statistics
used in wordprinting (counts/percentages of non-contextual words, word
pattern ratios, vocabulary richness).  The data can then be loaded into
R (or any other stats package) to be analyzed.

The program is currently in a beta state (usable, but we want to
possibly add more features and documentation), but I can send a copy to
anyone who is interested (specify if you have perl, or need a stand
alone copy (windows only)).

hope this helps,

Greg Snow, Ph.D.
Statistical Data Center, LDS Hospital
Intermountain Health Care
[EMAIL PROTECTED]
(801) 408-8111

>>> Werner Bier <[EMAIL PROTECTED]> 06/12/05 01:29PM >>>
Hi R-help,
 
I have a database of 10 students who have written an overall of 78
essays. 
The challenge? I would like to identify who wrote the 79th essay.
 
Has anybody used R in this context? 
 
Even if not, would you suggest me which pattern recognition technique I
might possibly apply?
 
Thanks a lot and regards,
Tom 



-


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Essay identification

2005-06-13 Thread Werner Bier
Thank you so much for all your answers.
Papers, codes, examples, methods...THANKS A LOT! :-) 
 
 
P.S. Thanks to Richard R, Berton, Gabor, Roger P, Ted H et all :-)  

[EMAIL PROTECTED] wrote:

On 12-Jun-05 Berton Gunter wrote:
> I assume that you know the usual procedure is to 'score'
> each essay by a vector that gives the frequency of occurrence
> of commonly used (sometimes adding subject matter specific)
> words and phrases. This multivariate response is then fed in
> as a "training set" into your favorite supervised
> learning/classification procedure. R has many of these -- trees,
> logisic regression, boosting, Random Forests,svm's,LDA,SOM's
> (whoops -- that's an Unsupervised one), ... . Try
> RSiteSearch('Classification',restrict=('functions').
> 
> The devil is in the details as to what works best, I believe.
> With only 78 exemplars in 10 groups, unless there is a lot of
> separation (disparate styles that you could probably detect
> manually) it may be difficult. It also depends on how large
> each group is (balance is generally better).
> 
> Cheers,
> Bert

I would add to Berton's list such scores as numbers of different
words used, sentence lengths, relative frequencies of verbs,
nouns, adjectives, adverbs, and so on, perhaps scaled by overall
length. Length of Essay might even be a discriminant!

You could also look at more subtle characteristics such as
"Zipf bins"[*] -- the relative numbers of different
words which occur once only, twice, three times, ... (though
I'm not sure how you would score such a thing for classification
purposes).
[*] A term I've just invented inspired by the original instance
of this by the linguist Zipf, later giving rise to the
logarithmic distribution in the historic paper by Fisher,
Corbett & Williams in the "Numbers of Species and Numbers
of Individuals" in butterfly traps.

If you really want to go to town you can try things related to
grammatical complexity, e.g. numbers of subordinate clauses
per sentence, relative clauses, the "reach" of relative pronouns
(how far from the referring pronoun is the thing referred to)
and so on.

There's quite an extensive literature on this sort of thing.
though it's not as fashionable as it used to be.

Th real problem is that you can get carried away by "good
ideas" of things to try!

The other factor to bear in mind is that if the Essays
can be grouped by subject this is likely to influence many
of the scores (such as the above).

Hoping this helps and does not distract!
Ted.



E-Mail: (Ted Harding) 
Fax-to-email: +44 (0)870 094 0861
Date: 13-Jun-05 Time: 00:43:10
-- XFMail --


-


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Essay identification

2005-06-12 Thread Ted Harding

On 12-Jun-05 Berton Gunter wrote:
> I assume that you know the usual procedure is to 'score'
> each essay by a vector that gives the frequency of occurrence
> of commonly used (sometimes adding subject matter specific)
> words and phrases. This multivariate response is then fed in
> as a "training set" into your favorite supervised
> learning/classification procedure. R has many of these -- trees,
> logisic regression, boosting, Random Forests,svm's,LDA,SOM's
> (whoops -- that's an Unsupervised one),  ... . Try
> RSiteSearch('Classification',restrict=('functions').
> 
> The devil is in the details as to what works best, I believe.
> With only 78 exemplars in 10 groups, unless there is a lot of
> separation (disparate styles that you could probably detect
> manually) it may be difficult. It also depends on how large
> each group is (balance is generally better).
> 
> Cheers,
> Bert

I would add to Berton's list such scores as numbers of different
words used, sentence lengths, relative frequencies of verbs,
nouns, adjectives, adverbs, and so on, perhaps scaled by overall
length. Length of Essay might even be a discriminant!

You could also look at more subtle characteristics such as
"Zipf bins"[*] -- the relative numbers of different
words which occur once only, twice, three times, ... (though
I'm not sure how you would score such a thing for classification
purposes).
[*] A term I've just invented inspired by the original instance
of this by the linguist Zipf, later giving rise to the
logarithmic distribution in the historic paper by Fisher,
Corbett & Williams in the "Numbers of Species and Numbers
of Individuals" in butterfly traps.

If you really want to go to town you can try things related to
grammatical complexity, e.g. numbers of subordinate clauses
per sentence, relative clauses, the "reach" of relative pronouns
(how far from the referring pronoun is the thing referred to)
and so on.

There's quite an extensive literature on this sort of thing.
though it's not as fashionable as it used to be.

Th real problem is that you can get carried away by "good
ideas" of things to try!

The other factor to bear in mind is that if the Essays
can be grouped by subject this is likely to influence many
of the scores (such as the above).

Hoping this helps and does not distract!
Ted.



E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 13-Jun-05   Time: 00:43:10
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Essay identification

2005-06-12 Thread Gabor Grothendieck
On 6/12/05, Werner Bier <[EMAIL PROTECTED]> wrote:
> Hi R-help,
> 
> I have a database of 10 students who have written an overall of 78 essays.
> The challenge? I would like to identify who wrote the 79th essay.
> 
> Has anybody used R in this context?
> 
> Even if not, would you suggest me which pattern recognition technique I might 
> possibly apply?

Check out

http://xxx.uni-augsburg.de/PS_cache/cond-mat/pdf/0108/0108530.pdf

for a simple method.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Essay identification

2005-06-12 Thread Berton Gunter
I assume that you know the usual procedure is to 'score' each essay by a
vector that gives the frequency of occurrence of commonly used (sometimes
adding subject matter specific) words and phrases. This multivariate
response is then fed in as a "training set" into your favorite supervised
learning/classification procedure. R has many of these -- trees, logisic
regression, boosting, Random Forests,svm's,LDA,SOM's (whoops -- that's an
Unsupervised one),  ... . Try
RSiteSearch('Classification',restrict=('functions').

The devil is in the details as to what works best, I believe. With only 78
exemplars in 10 groups, unless there is a lot of separation (disparate
styles that you could probably detect manually) it may be difficult. It also
depends on how large each group is (balance is generally better).

Cheers,
Bert

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Werner Bier
Sent: Sunday, June 12, 2005 12:30 PM
To: r-help@stat.math.ethz.ch
Subject: [R] Essay identification

Hi R-help,
 
I have a database of 10 students who have written an overall of 78 essays. 
The challenge? I would like to identify who wrote the 79th essay.
 
Has anybody used R in this context? 
 
Even if not, would you suggest me which pattern recognition technique I
might possibly apply?
 
Thanks a lot and regards,
Tom 



-


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Essay identification

2005-06-12 Thread Werner Bier
Hi R-help,
 
I have a database of 10 students who have written an overall of 78 essays. 
The challenge? I would like to identify who wrote the 79th essay.
 
Has anybody used R in this context? 
 
Even if not, would you suggest me which pattern recognition technique I might 
possibly apply?
 
Thanks a lot and regards,
Tom 



-


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html