Re: [datameet] Phonetic Similarity

2018-08-15 Thread Pradeep Bhatt
Thanks !

I will have a look and update here.

On Tue, Aug 14, 2018 at 3:53 PM Nikhil VJ  wrote:

> Hi Pradeep,
>
> If you have all the words in one column of an excel, then *OpenRefine*
> tool can help you "iron out" the differences. It will show you a cluster of
> similar looking cells, and you can decide which will be the one to go with
> (you can even type in a new standardised value if all options are wrong).
> It will then over-write all those cells with the one standardised value.
> The rest of your data remains intact. No need of sorting, filtering etc.
>
> You can read a basic walkthrough for this specific use case here:
> http://datameet.org/2018/06/13/openrefine-bus-stop/
>
> It uses multiple algorithms to detect similar words, similar to what
> search engines and dictionaries do when you make a typo. You can modify the
> algorithm options and do new scans to catch the hard-to-find ones. If there
> is a false-positive, you can just ignore that and no changes will be done
> to those values.
>
>
> --
> Cheers,
> Nikhil VJ
> +91-966-583-1250
> Pune, India
> Website 
> DataMeet Pune chapter 
> Self-designed learner at Swaraj University <
> http://www.swarajuniversity.org>
> Payment / Contribute 
>
> On Tue, Aug 14, 2018 at 8:07 AM, Venkata Pingali 
> wrote:
>
>> Soundex is not enough. We went through metaphone and
>> double-metaphone as well. The last showed the best
>> performance when combined with simple ways to reduce
>> the search space (e.g., names that start with the same
>> alphabet).
>>
>> But it still had too many false positives and negatives. We ended up
>> using a much simpler approach of manually labeling Top N most
>> frequent names.
>>
>>
>>
>> On Tue, Aug 14, 2018 at 7:58 AM, Pradeep Bhatt 
>> wrote:
>>
>>> Hi All,
>>>
>>> What is the best way to know if two words are phonetically similar
>>>
>>> e.g *Some similar *words
>>>
>>> Pradeep - Pradip
>>> Thakkkar - Thakkar
>>> Rathod - Rathor
>>> Swetha - Sweta
>>> bhen - ben
>>> Sumandev - Sumandeb
>>>
>>> *Non - Similar*
>>> Ramesh - Rajesh
>>>
>>> This is needed for spelling mistakes introduced when translating from
>>> indian languages to English.
>>>
>>> Does Soundex work well for Indian names ?
>>>
>>> Regards,
>>> Pradeep
>>>
>>>
>>>
>>> --
>>> Datameet is a community of Data Science enthusiasts in India. Know more
>>> about us by visiting http://datameet.org
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "datameet" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to datameet+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to datameet+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [datameet] Phonetic Similarity

2018-08-14 Thread Nikhil VJ
Hi Pradeep,

If you have all the words in one column of an excel, then *OpenRefine* tool
can help you "iron out" the differences. It will show you a cluster of
similar looking cells, and you can decide which will be the one to go with
(you can even type in a new standardised value if all options are wrong).
It will then over-write all those cells with the one standardised value.
The rest of your data remains intact. No need of sorting, filtering etc.

You can read a basic walkthrough for this specific use case here:
http://datameet.org/2018/06/13/openrefine-bus-stop/

It uses multiple algorithms to detect similar words, similar to what search
engines and dictionaries do when you make a typo. You can modify the
algorithm options and do new scans to catch the hard-to-find ones. If there
is a false-positive, you can just ignore that and no changes will be done
to those values.


--
Cheers,
Nikhil VJ
+91-966-583-1250
Pune, India
Website 
DataMeet Pune chapter 
Self-designed learner at Swaraj University 
Payment / Contribute 

On Tue, Aug 14, 2018 at 8:07 AM, Venkata Pingali  wrote:

> Soundex is not enough. We went through metaphone and
> double-metaphone as well. The last showed the best
> performance when combined with simple ways to reduce
> the search space (e.g., names that start with the same
> alphabet).
>
> But it still had too many false positives and negatives. We ended up
> using a much simpler approach of manually labeling Top N most
> frequent names.
>
>
>
> On Tue, Aug 14, 2018 at 7:58 AM, Pradeep Bhatt 
> wrote:
>
>> Hi All,
>>
>> What is the best way to know if two words are phonetically similar
>>
>> e.g *Some similar *words
>>
>> Pradeep - Pradip
>> Thakkkar - Thakkar
>> Rathod - Rathor
>> Swetha - Sweta
>> bhen - ben
>> Sumandev - Sumandeb
>>
>> *Non - Similar*
>> Ramesh - Rajesh
>>
>> This is needed for spelling mistakes introduced when translating from
>> indian languages to English.
>>
>> Does Soundex work well for Indian names ?
>>
>> Regards,
>> Pradeep
>>
>>
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to datameet+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [datameet] Phonetic Similarity

2018-08-13 Thread Venkata Pingali
Soundex is not enough. We went through metaphone and
double-metaphone as well. The last showed the best
performance when combined with simple ways to reduce
the search space (e.g., names that start with the same
alphabet).

But it still had too many false positives and negatives. We ended up
using a much simpler approach of manually labeling Top N most
frequent names.



On Tue, Aug 14, 2018 at 7:58 AM, Pradeep Bhatt 
wrote:

> Hi All,
>
> What is the best way to know if two words are phonetically similar
>
> e.g *Some similar *words
>
> Pradeep - Pradip
> Thakkkar - Thakkar
> Rathod - Rathor
> Swetha - Sweta
> bhen - ben
> Sumandev - Sumandeb
>
> *Non - Similar*
> Ramesh - Rajesh
>
> This is needed for spelling mistakes introduced when translating from
> indian languages to English.
>
> Does Soundex work well for Indian names ?
>
> Regards,
> Pradeep
>
>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[datameet] Phonetic Similarity

2018-08-13 Thread Pradeep Bhatt
Hi All,

What is the best way to know if two words are phonetically similar

e.g *Some similar *words

Pradeep - Pradip
Thakkkar - Thakkar
Rathod - Rathor
Swetha - Sweta
bhen - ben
Sumandev - Sumandeb

*Non - Similar*
Ramesh - Rajesh

This is needed for spelling mistakes introduced when translating from
indian languages to English.

Does Soundex work well for Indian names ?

Regards,
Pradeep

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.