let me rephrase the problem. I already have a set of bad words. I want to avoid people inputting typos of the bad words. for example 'shit' is banned, but someone may enter sh1t.
how can i flag those phonetically similar bad words to the marked bad words? Best. On Thu, Sep 4, 2008 at 5:02 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: > > 4 sep 2008 kl. 15.54 skrev Cam Bazz: > > yes, I already have a system for users reporting words. they fall on an >> operator screen and if operator approves, or if 3 other people marked it >> as >> curse, then it is filtered. >> in the other thread you wrote: >> >> I would create 1-5 ngram sized shingles and measure the distance using >>> >> Tanimoto coefficient. That would probably work out just fine. ?>You might >> want to add more weight the greater the size of the shingle. >> >>> >>> There are shingle filters in lucene/java/contrib/analyzers and there is a >>> >> Tanimoto distance in lucene/mahout/. >> >> would that apply to my case? tanimoto coefficient over shingles? >> > > Not really, no. > > > karl > > > > >> >> Best, >> >> >> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]> >> wrote: >> >> >>> 4 sep 2008 kl. 14.38 skrev Cam Bazz: >>> >>> >>> Hello, >>> >>>> This came up before but - if we were to make a swear word filter, string >>>> edit distances are no good. for example words like `shot` is confused >>>> with >>>> `shit`. there is also problem with words like hitchcock. appearently i >>>> need >>>> something like soundex or double metaphone. the thing is - these are >>>> language specific, and i am not operating in english. >>>> >>>> I need a fuzzy like curse word filter for turkish, simply. >>>> >>>> >>> You probably need to make a large list of words. I would try to learn >>> from >>> the users that do swear, perhaps even trust my users to report each >>> other. I >>> would probably also look at storing in what context the word is used, >>> perhaps by adding the surrounding words (ngrams, shingles, markov >>> chains). >>> Compare "go to hell" and "when hell frezes over". The first is rather >>> derogatory while the second doen't have to be bad at all. >>> >>> I'm thinking Hidden Markov Models and Neural Networks. >>> >>> >>> karl >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >