On 10 October 2010 23:41, Michael Neale <[email protected]> wrote: > I think you should clean room implement it (or reuse some old code of yours > if it is safe to do so). From what I have seen of the algorithm - it isn't > huge - and it would make sense to have it re-implemented. As an alternative > - consider taking a look at the MVEL soundex code and rewriting that - and > we will see if we can make it upstream.
I just re-implemented this according to the algorithm I found in http://en.wikipedia.org/wiki/Soundex I've also consulted a CPAN module, to learn what was intended by the MVEL implementation, but it's undecidable (possibly due to omissions or bugs). > I would say it is just slightly > neglected - its not well known that it lives there. Using the MVEL one was > just opportunistic for drools. > I didn't know that it could return null, that is bad. I guess if it is null > - that would mean that you just do a literal case insensitive compare? A correct implementation never returns null. An empty word might, but for our purpose "" would be preferable. > Also - AFAIK - soundex is only for english right? Certainly. > Is there an equivalent for other languages? Soundex is coarse even for English. I've found the atrocious example that the Soundex for "Britney Spears" is the same as for "bewährten Superzicke" (~ "proven super-b*"). NYSIIS<http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System>is supposed to be better. For German, there is an equivalent: "Kölner Phonetik". It might make sense to provide this for an operator "soundex[de]". (All of /M[ae][iy]e?r/ sound alike in German, and all exist as proper names.) I have also found one link to an implementation adapted for French. Soundex is aimed at the pronunciation of proper names. There might be some leeway for that even in a language like Hungarian, which is pronounced exactly as written. I think Drools should drop the MVEL version and go for a flexible approach, possibly even s.th. better than Soundex/NARA for English. I'll research this some more, and report back before I commit anything ;-) -W > If so, perhaps having it in the drools codebase makes sense > and opens the way for people to plug in their own soundex. > On Mon, Oct 11, 2010 at 2:54 AM, Wolfgang Laun <[email protected]> > wrote: >> >> The implementation of "soundslilke" is broken in more than one respect. >> The conversion of a word to a Soundex string is provided by >> org.mvel2.util.Soundex. >> (.) There are words where Soundex.soundex returns null, so that the >> calling code, in Drools, crashes with a NPE. >> (.) The algorithm implemented in Soundex is erroneous. I'm not sure which >> Soundex algorithm it is supposed to implement, but it just doesn't meet the >> basic requirements. >> >> I have implemented, correctly, the version for the National Archives and >> Records Administration (NARA) rule set for the official implementation of >> Soundex used by the U.S. Government. >> >> Do we wait for MVEL to correct this bug, or do we just replace it with a >> correct implementation? >> >> Regards >> Wolfgang
_______________________________________________ rules-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/rules-dev
