I think the short answer is that I'm not sure. I'm not entirely clear
what a match probability ratio is, so let me provide a brief accounting
of what NSP provides, and you can see if that is sufficient to compute
what you are after....

In the case of

John Edwards
Jon Edwards

NSP (count.pl) would give you counts of how often John occurred,
how often Jon occurred, how often Edwards occurred, how often
"John Edwards" occurred, and how often "Jon Edwards" occurred. Now,
from that statistic.pl can compute measures of association that will
tell you how strongly associted "John" and "Edwards" are, and "Jon"
and "Edwards" are. It won't really tell you anything about the association
between "John Edwards" and "Jon Edwards" directly, I don't think.

So, I'm not sure if this helps. Can you describe more exactly what
you wish to compute (especially if it involves any of the quantities
mentioned above).

Also, keep in mind there are some utility programs that come with
NSP that *might* be useful - these include rank.pl and kocos.pl. But,
more about those if they seem relevant.

Thanks,
Ted

On Fri, 16 Dec 2005, dave1234870 wrote:

> Greetings all, my first post to the group hoping that its an
> appropriate forum for this question ... if not, my apologies to the
> group.
>
> I'm a moderately proficient, self-taught Perl hacker working in the
> fraud examination type industry.  I work with large amounts of data to
> identify scenarios wherein Names and/or Addresses serve as nexus
> points for discrete network analysis.  Of course, my problem is that
> names and addresses are quite often misspelled or not consistent.
> Examples,
>
> John Edwards
> Jon Edwards
>
> 123 Main Street
> 123 Main St
>
> PO Box 123
> Post Office Box 123
> etc.
>
> I've read over the docs for the NSP package, but am having a hard time
> wrapping my brain around it.  Would it be possible for the NSP package
> (count.pl and statistic.pl) to accomplish a test upon a pair of names
> to achieve a match probability ratio?
>
> In a perfect world, I want to open a large file with 1 long list of
> names.  Starting at the first name, I want to iterate over the entire
> list and achieve ratio proabilities for each pair of names.  As each
> ratio is computed, I'll test it for a threshold and if the pair
> exceeds a threshold, I'll push it to an array.  Repeat for the 2nd
> name in the list, 3rd name in the list, etc.
>
> Thanks in advance for any wisdom you might have on this question :-)
>
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


------------------------ Yahoo! Groups Sponsor --------------------~--> 
Most low income homes are not online. Make a difference this holiday season!
http://us.click.yahoo.com/5UeCyC/BWHMAA/TtwFAA/dpFolB/TM
--------------------------------------------------------------------~-> 

 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/ngram/

<*> To unsubscribe from this group, send an email to:
    [EMAIL PROTECTED]

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 


Reply via email to