The only "better" solution I can think of is to map the characters into their 
non-accented equivalent. While I think it's important to state that the default 
Soundex implementation is for English words, it would be nice to accommodate words 
with accented characters.

My bigger concern is that the behavior is inconsistent between Soundex, Metaphone, & 
DoubleMetaphone. Soundex will not throw an IllegalArgumentException, whereas Metaphone 
passes through the "bad" character. DoubleMetaphone has support for two accented 
characters, C with Cedilla and N with tilde.

To the extent that I think the language codecs should be swappable components, it's a 
good idea for the support to be consistent. To that end, a String passed to any of the 
codecs should either throw an exception for all or none.

Just my 2 cents.
 

-----Original Message-----
From: Gary Gregory [mailto:[EMAIL PROTECTED] 
Sent: Sunday, May 23, 2004 8:37 PM
To: Jakarta Commons Developers List
Subject: [codec] Soudex issue with accented character.


http://nagoya.apache.org/bugzilla/show_bug.cgi?id=29080

Currently, "ö" or "é" in a String causes Soundex to throw an 
ArrayIndexOutOfBoundsException.

We can either:

(1) Throw a better Exception, like IllegalArgumentException: Only 'plain' letter are 
allowed.

Or:

(2) Ignore unmapped characters. This would work for "ö" and "é" since vowels are 
ignored but this could cause bad encoding values for other chars like "ç".

AFAIK, you cannot ask if a character is a vowel or not.

Thoughts?

Gary


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to