Giving this a little more thought it seems to me to be a pretty tricky application of 
statistical analysis.  The problem being that the underlaying process, a persons 
writing, may itself be pretty variable.  In my own case I am aware of my own tendency 
to write in run on sentences (not to mention that if I do it it is a run on sentence, 
if William Faulkner does it it is a brilliant use of stream of consciousness 
technique).  If I am feeling lazy or hurried this will go uncorrected; however if I 
have a little time I will clean it up.  Therefore an indicator such as the ratio of 
compound sentences to simple sentences, or the average word length of sentences may 
vary considerably between different samples of my writing.  Not to mention a 
difference in style if I am writing to simply explain something versus advocating or 
trying to convince someone else of something, or flaming someone.  Systematic 
differences in the way I write now as opposed to 10 years ago could also be a probl!
!
em.  The obvious way to attack the problem is with a lot of samples and indicators 
which are then compared against a control to determine what is the most distinctive 
chararcteristics of each writer.  Still you probably need to come up with a standard 
deviation for each indicator and the answer is going to be a probability rather than a 
certainty, but that's life.  You should probably also ask yourself what is more 
important Type I or Type II errors, or if you need a good balance between them.  For 
instance I can identify 100% of the posts made by Tim simply by identifying every post 
as being from him.  I have now identified 100% his posts but it may not be very 
useful.  On the otherhand if I am concerned about not attributing a post by one nym to 
another nym incorrectly I simply don't attribute any posts to the first nym.  Thus I 
could say the indicator is 100% accurate but it simply isn't very useful.  So I think 
you need to work out good control to judge samples against and!
!
 possibly some scheme to "stratify" samples, for instances not using my e-mail 
psotings with business writing as the 2 will have different characteristics but rather 
using samples of my postings to identify my many anonymous postings and my business 
writing to identify my many anonymous business writings.  (if I were feeling more 
energetic on Monday morning I would break that up into a couple of shorter sentences) 
But you need to be very careful not to introduce too much subjectivity into the 
categorization.
 

Jim

--

On Sun, 13 May 2001 23:03:56   Ryan Sorensen wrote:
>
>* Tim May <[EMAIL PROTECTED]> [010513]:
>> At 9:41 PM -0700 5/13/01, Ryan Sorensen wrote:
>> >So I get this idea.
>> >Crypto is great for lots of things, but anonymous public postings it's not.
>> >I know this has been discussed here before, but I haven't seen specifics.
>> >
>> >
>> >What exactly makes a person's writing style distinctive?
>> >
>> >Is it distinctive phrases?
>> >Number of syllables?
>> >
>> >And almost the inverse, how would you come up with a "generic" writing style?
>> >
>> >Any help is appreciated.
>> >Including pointers to online resources or past discussions, if they 
>> >have any specifics.
>> 
>> Think in terms of how _you_ would try to identify similar styles.
>> 
>> -- British or foreign usages
>> 
>> -- type of emphasis indicators (like _this_ or like *this* or like....)
>> 
>> -- use of ellipses, em dashes, etc.
>> 
>> -- vocabulary, phrases
>> 
>I didn't want to have to come up with my own list if someone had done work
>before me.
>
>Other ones I was thinking of were the number of syllables used in words,
>length of paragraphs, number of times sentences are "split" and go on their
>way towards run on sentences. These would be in addition to the ones listed
>by Jim Windle. (Thanks Jim!)
>
><snipped>
>> Will frequent posters to this and other mailing lists have specific 
>> posts fall into correlation "bins"?  You tell us.
>On this list? It's hard to say. I haven't been actively paying attention to
>the way people write here for long enough.
>There are other lists, in particular dc-stuff, where people give long
>passages with enough idiosyncrasies to give me the gut feeling that they
>could be categorized at least.
>I will of course report more as I begin to actually run some statistical
>tests against the posts.
>
>> 
>> ** Tim May
>> -- 
>> Timothy C. May         [EMAIL PROTECTED]        Corralitos, California
>> Political: Co-founder Cypherpunks/crypto anarchy/Cyphernomicon
>> Technical: physics/soft errors/Smalltalk/Squeak/agents/games/Go
>> Personal: b.1951/UCSB/Intel '74-'86/retired/investor/motorcycles/guns
>> 
>
>Is it common on mailing lists to adress people by first name?
>I know I see this sort of behavior on mailing lists between regulars, but
>I'm not quite sure of how it works with people new to posting on said lists.
>And Tim, this was the question I was asking you earlier. I notice now it may
>have been misconstrued as a poor jab regarding the recent "Timmy" thing.
>
>--Ryan Sorensen
>
>


Join 18 million Eudora users by signing up for a free Eudora Web-Mail account at 
http://www.eudoramail.com

Reply via email to