Re: [R] text vector clustering

David Winsemius Thu, 22 Jan 2009 07:12:06 -0800

Simply doing a tabulation and isolating the cases with only one entrymight have been a possibility if the count discrepancy weren't sohigh. It appears you have a greater degree of corruption than would beexpected just from "typos".


Have you looked at the packages referenced at:

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

The Soundex algorithm is an old programming chestnut which I have seenimplemented in R, but I understand there are improved versions. Howwell they perform on persons' names may depend strongly on culturalorigins of your population.


--
David Winsemius

On Jan 22, 2009, at 6:03 AM, srinivasa raghavan wrote:

Hi,
I am a new user of R using R 2.8.1 in windows 2003. I have a csvfile with
single column which contain the 30,000 students names. There were typo
errors while entering this student names. The actual list of namesis <
1000. However we dont have that list for keyword search.

I am interested in grouping/cluster these names   as those which are
similar letter to letter. Are there any text clustering algorithmin Rwhich can group names of similar type in to segments of exactlymatching ,
90% matching, 80% matching,....etc.

thanks in advance,

regards,
srinivas
statistical analyst.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] text vector clustering

Reply via email to