I submitted a patch to handle Aspell phonetic rules. You can find it in JIRA.
On Thu, 4 Sep 2008 17:07:09 +0300, "Cam Bazz" <[EMAIL PROTECTED]> wrote: > let me rephrase the problem. I already have a set of bad words. I want to > avoid people inputting typos of the bad words. > for example 'shit' is banned, but someone may enter sh1t. > > how can i flag those phonetically similar bad words to the marked bad > words? > > Best. > > On Thu, Sep 4, 2008 at 5:02 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: > >> >> 4 sep 2008 kl. 15.54 skrev Cam Bazz: >> >> yes, I already have a system for users reporting words. they fall on an >>> operator screen and if operator approves, or if 3 other people marked > it >>> as >>> curse, then it is filtered. >>> in the other thread you wrote: >>> >>> I would create 1-5 ngram sized shingles and measure the distance using >>>> >>> Tanimoto coefficient. That would probably work out just fine. ?>You > might >>> want to add more weight the greater the size of the shingle. >>> >>>> >>>> There are shingle filters in lucene/java/contrib/analyzers and there > is a >>>> >>> Tanimoto distance in lucene/mahout/. >>> >>> would that apply to my case? tanimoto coefficient over shingles? >>> >> >> Not really, no. >> >> >> karl >> >> >> >> >>> >>> Best, >>> >>> >>> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]> >>> wrote: >>> >>> >>>> 4 sep 2008 kl. 14.38 skrev Cam Bazz: >>>> >>>> >>>> Hello, >>>> >>>>> This came up before but - if we were to make a swear word filter, > string >>>>> edit distances are no good. for example words like `shot` is confused >>>>> with >>>>> `shit`. there is also problem with words like hitchcock. appearently > i >>>>> need >>>>> something like soundex or double metaphone. the thing is - these are >>>>> language specific, and i am not operating in english. >>>>> >>>>> I need a fuzzy like curse word filter for turkish, simply. >>>>> >>>>> >>>> You probably need to make a large list of words. I would try to learn >>>> from >>>> the users that do swear, perhaps even trust my users to report each >>>> other. I >>>> would probably also look at storing in what context the word is used, >>>> perhaps by adding the surrounding words (ngrams, shingles, markov >>>> chains). >>>> Compare "go to hell" and "when hell frezes over". The first is rather >>>> derogatory while the second doen't have to be bad at all. >>>> >>>> I'm thinking Hidden Markov Models and Neural Networks. >>>> >>>> >>>> karl >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>> >>>> >>>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]