statistical similarity of two text

2001-07-17 Thread Cantor

Hello,
Does anybody know where I can find program on the website which compare two
text/article and settle whether or not they are similar assuming any
significant level.

Thanks in advance
Cantor




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: statistical similarity of two text

2001-07-18 Thread Donald Burrill

On Tue, 17 Jul 2001, Cantor wrote:

> Does anybody know where I can find program on the website which [can] 
> compare two texts/articles and settle whether or not they are similar 
> assuming any significant level.

Sorry, Cantor:  this is not possible, in general.  
 One can discover whether two (or more) things _differ_ (on some 
quantitative measure) at a specified significance level (when this is a 
reasonable thing to do -- it isn't always reasonable), but the formal 
definition of "significant" in statistical analysis does not permit 
discovering whether two (or more) things are _similar_.  
 However, it may suffice for your purposes to discover that two things 
are not different enough that you can tell them apart (which is not the 
same thing as discovering that they are the same), on whatever measure 
(or set of measures) you choose to analyze.  Whether this be a useful 
outcome or not depends heavily on how much information you have (that 
is, on the size of the sample available) on the things being compared. 

In any case, the hard part is defining the characteristics, or  
properties, or measures, on which the two texts/articles are to be 
compared. 

 
 Donald F. Burrill [EMAIL PROTECTED]
 184 Nashua Road, Bedford, NH 03110  603-471-7128



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: statistical similarity of two text

2001-07-19 Thread Rich Ulrich

On 18 Jul 2001 03:41:57 -0700, [EMAIL PROTECTED] (Donald Burrill)
wrote:

> On Tue, 17 Jul 2001, Cantor wrote:
> 
> > Does anybody know where I can find program on the website which [can] 
> > compare two texts/articles and settle whether or not they are similar 
> > assuming any significant level.
DB > 
> Sorry, Cantor:  this is not possible, in general.  
>  One can discover whether two (or more) things _differ_ (on some 
> quantitative measure) at a specified significance level (when this is a 
> reasonable thing to do -- it isn't always reasonable), but the formal 
> definition of "significant" in statistical analysis does not permit 
> discovering whether two (or more) things are _similar_.  
>  However, it may suffice for your purposes to discover that two things 
> are not different enough that you can tell them apart (which is not the 
> same thing as discovering that they are the same), on whatever measure 

Don, 
 - this puts me in mind of other Similarities.  We do not
speak of "significance" but there are other things to describe,
and "probability" comes into the descriptions

Do we have a taxonomy of models?  You just mentioned two or
three questions:
(a) two things are different; 
(b) two things are 'not different enough to tell apart';
(c) two things are shown to be the same.

We have discussed "Bioequivalence" before - (b) and (c)
are not EXACTLY the same problem, I don't think.

Texts: Are two texts similar enough to be the same author?
 - to be assuredly the same author? ["Shakespeare?"][ Or, how 
many authors were there, of the books of the New Testament.]
 - to be plagiarism, or cheating on a test?
Are these the same question, except that one concerns content 
and the other is style?

Where similarity is accepted in court:
P-levels have not been a concern with Fingerprints, but that is
because those experts have done a peculiar job, insisting on
HIGH confidence; and refusing to refuse to testify to the "partial"
matches on degraded prints.

DNA matching: probabilities have been more explicit.
The lab work for blood-typing is similar, but the explicit
probabilities are sometimes mediocre instead of convincingly
extreme.

My apologies for rambling.  I didn't have a question to focus on,
but maybe this will remind someone of something more that ought 
to be said about  'similarities' .
-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: statistical similarity of two text

2001-07-19 Thread Dennis Roberts

At 04:21 PM 7/19/01 -0400, Rich Ulrich wrote:
>On 18 Jul 2001 03:41:57 -0700, [EMAIL PROTECTED] (Donald Burrill)
>wrote:
>
> > On Tue, 17 Jul 2001, Cantor wrote:
> >
> > > Does anybody know where I can find program on the website which [can]
> > > compare two texts/articles and settle whether or not they are similar
> > > assuming any significant level.
>DB >
> > Sorry, Cantor:  this is not possible, in general.


some try however ... see http://www.turnitin.com ... a plagiarism detection 
company


_
dennis roberts, educational psychology, penn state university
208 cedar, AC 8148632401, mailto:[EMAIL PROTECTED]
http://roberts.ed.psu.edu/users/droberts/drober~1.htm



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: statistical similarity of two text

2001-07-22 Thread Yuval Feinstein



As Rich Ulrich and others have already said:
Do you want to check "surface" similarity?
You can use a bag-of-words representation
and a cosine distance between two texts, or
more sophisticated versions:
See:
An Information-Theoretic Definition of Similarity , by Dekang Lin:
http://www.cs.ualberta.ca/~lindek/papers.htm

Other people use clustering to find similarity:
http://lsi.research.telcordia.com/lsi/LSIpapers.html

I guess variants of these methods can be used to detect plagiarism.
If you want to go deeper, to determine whether two text documents
are about a similar subject - This is a much harder problem

--
Yuval Feinstein
email : [EMAIL PROTECTED]
--




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: statistical similarity of two text

2001-07-22 Thread Jason Harrison

"Cantor" <[EMAIL PROTECTED]> writes:
>Does anybody know where I can find program on the website which compare two
>text/article and settle whether or not they are similar assuming any
>significant level.

You may want to try Latent Semantic Analysis -- this technique looks
at the meaning of the words used in the texts.  It has been used to
evaluate essays for the quality level of the writing.

lsa.colorado.edu -- the site is down now but they did have a compare
two texts web page.

-Jason
-- 
J. [EMAIL PROTECTED]
Graduate Motto: Free-time with guilt.
http://www.cs.ubc.ca/~harrison
http://www.cs.ubc.ca/~harrison/dance


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: statistical similarity of two text

2001-07-25 Thread Normand Peladeau

Similar on which aspects?  Syntax? grammar? vocabulary? content? style?.

I wrote a content analysis module called WordStat that can allows you to
find the differences between two or more text (different authors, subgroups
of individuals such as male vs female or age subgroups, etc.).  You can
analyze differences in word usage or in categories of words (predefined or
user defined).  You can download a trial version of WordStat from the
following URL:

www.simstat.com/wordstat.htm

However, it won't necessarily provide you with a single p value and you will
have to identify the dimension you want to compare, and if needed, create a
categorization dictionary.

You can also find a list of other computer-assisted text analysis (CATA)
software from the following URL:

www.intext.de/TEXTANAE.HTM


Normand Peladeau
Provalis Research


"Cantor" <[EMAIL PROTECTED]> wrote in message
news:9j1ir6$jbv$[EMAIL PROTECTED]...
> Hello,
> Does anybody know where I can find program on the website which compare
two

> text/article and settle whether or not they are similar assuming any
> significant level.
>
> Thanks in advance
> Cantor
>
>






=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=