Re: string similarity measures

Karl Wettin Thu, 04 Sep 2008 07:03:28 -0700


4 sep 2008 kl. 15.54 skrev Cam Bazz:

yes, I already have a system for users reporting words. they fall onanoperator screen and if operator approves, or if 3 other peoplemarked it as
curse, then it is filtered.
in the other thread you wrote:
I would create 1-5 ngram sized shingles and measure the distanceusing
Tanimoto coefficient. That would probably work out just fine. ?>Youmight
want to add more weight the greater the size of the shingle.
There are shingle filters in lucene/java/contrib/analyzers andthere is a
Tanimoto distance in lucene/mahout/.

would that apply to my case? tanimoto coefficient over shingles?


Not really, no.


     karl

Best,
On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]>wrote:
4 sep 2008 kl. 14.38 skrev Cam Bazz:


Hello,
This came up before but - if we were to make a swear word filter,stringedit distances are no good. for example words like `shot` isconfused with`shit`. there is also problem with words like hitchcock.appearently i
need
something like soundex or double metaphone. the thing is - these are
language specific, and i am not operating in english.

I need a fuzzy like curse word filter for turkish, simply.
You probably need to make a large list of words. I would try tolearn fromthe users that do swear, perhaps even trust my users to report eachother. I
would probably also look at storing in what context the word is used,
perhaps by adding the surrounding words (ngrams, shingles, markovchains).
Compare "go to hell" and "when hell frezes over". The first is rather
derogatory while the second doen't have to be bad at all.

I'm thinking Hidden Markov Models and Neural Networks.


        karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: string similarity measures

Reply via email to